Stata: using egen group() to create unique identifiers

Stata: using egen group() to create unique identifiers - stata

I have a dataset where each row is a firm, year pair with a firmid that is a string.
If I do
duplicates drop firmid year, force
it doesn't delete anything since there are no duplicates (I originally created the dataset after running duplicates drop firmid year, force).
So far so good. I want to create a panel which requires a firmid that is numeric. So I run
egen newid = group(firmid)
xtset newid year
But the 'repeated time values in panel' error pops up. Moreover,
duplicates list newid year
lists a whole bunch of duplicates.
It seems as though egen, group() isn't generating unique groups. My question is: why, and how do I create unique groups in a robust way?

This is an old thread, but I have recently experienced the same symptoms, so I wanted to share my solution. Of course, so long as the questioner does not give further details, we will not know whether the causes are the same for me and him.
The problem turned out to be an issue of precision. As explained here in section 4.4, calculations done on integers stored as floats are precise only in the range up to 16,777,216. So, if you have more than 16,777,216 firms in your sample, rounding error will result in the same ID being assigned to multiple firms. This is straightforwardly dealt with by increasing the precision of the ID variable to long:
egen long newid = group(firmid)

Related

Ranking sum of data by two categories

I have data that I would like to rank by two separate categories, State and ServiceType. Essentially, there are multiple years of data for each ServiceType across various states, and I was hoping to get the sum of all years for each ServiceType by State, meaning each State is treated independently and the sums of the various categories are ranked only within that state, not nationally.
I've tried
bys State ServiceCategory (quant_variable): ///
egen rank_quant_variable= rank(sum(quant_variable)), field
as well as a version of above where I used a pre-calculated sum variable. Both don't really work.

This lacks a reproducible example, as you do not give your data or phrase your problem in terms of a dataset we could download, for example as loaded with or referred to in Stata. There is no need to give the full dataset but just a minimal example with the same structure.
The call to sum() here would be to Stata's sum() function, which yields the cumulative or running sum, which evidently isn't what you want. So that case is easy to dismiss.
The problem remaining is quite what you did in the code you don't show with a pre-calculated sum.
At a guess you worked out
bys State ServiceCategory: egen sum = total(quant_variable)
and then pushed that sum through rank(). But that would use each value of sum as many times as it occurred.
Perhaps you want something more like this:
egen tag = tag(State ServiceCategory)
bysort State: egen rank_quant_variable = rank(sum) if tag, field
bysort State (rank): replace rank = rank[1]
But it's really hard (for me) to visualize this without details on what you did or an example to work on.

Clarification on tabstat use after bysort in Stata

I have a rather simple question regarding the output of tabstat command in Stata.
To be more specific, I have a large panel dataset containing several hundred thousands of observations, over a 9 year period.
The context:
bysort year industry: egen total_expenses=total(expenses)
This line should create total expenses by year and industry (or sum of all expenses by all id's in one particular year for one particular industry).
Then I'm using:
tabstat total_expenses, by(country)
As far as I understand, tabstat should show in a table format the means of expenses. Please do note that ids are different from countries.
In this case tabstat calculates the means for all 9 years for all industries for a particular country, or it just the mean of one year and one industry by each country from my panel data?
What would happen if this command is used in the following context:
bysort year industry: egen mean_expenses=mean(expenses)
tabstat mean_expenses, by(country)
Does tabstat creates means of means? This is a little bit confusing.

I don't know what is confusing you about what tabstat does, but you need to be clear about what calculating means implies. Your dataset is far too big to post here, but for your sake as well as ours creating a tiny sandbox dataset would help you see what is going on. You should experiment with examples where the correct answer (what you want) is obvious or at least easy to calculate.
As a detail, your explanation that ids are different from countries is itself confusing. My guess is that your data are on firms and the identifier concerned identifies the firm. Then you have aggregations by industry and by country and separately by year.
bysort year industry: egen total_expenses = total(expenses)
This does calculate totals and assigns them to every observation. Thus if there are 123 observations for industry A and 2013, there will be 123 identical values of the total in the new variable.
tabstat total_expenses, by(country)
The important detail is that tabstat by default calculates and shows a mean. It just works on all the observations available, unless you specify otherwise. Stata has no memory or understanding of how total_expenses was just calculated. The mean will take no account of different numbers in each (industry, year) combination. There is no selection of individual values for (industry, year) combinations.
Your final question really has the same flavour. What your command asks for is a brute force calculation using all available data. In effect your calculations are weighted by the numbers of observations in whatever combinations of industry, country and year are being aggregated.
I suspect that you need to learn about two commands (1) collapse and (2) egen, specifically its tag() function. If you are using Stata 16, frames may be useful to you. That should apply to any future reader of this using a later version.

Stata generate by groups

I have a very weird thing happening with my code. I have panel data set with the panel id being p_id and I am trying to create a another variable by using panel_id. My code is this, where p_id is the panel id, marital_status of person observed in each time period and x is the variable I would want to create.
bys p_id: gen count =_N
bys p_id: gen count1 =_n
bys p_id: gen x= marital_status if count1 ==1
However when I do
tab x
I get different numbers for rows (row total does not change) each time I run this code. The numbers are pretty closely clustered, but I need to understand why this is happening.

Although the lack of a reproducible example is poor practice, it is possible to guess at what is going on. The first line of code is not problematic, but the second two have the same effect as
bys p_id: gen x = marital_status if _n == 1
In words, the new variable contains marital status data from the first observation in each group of observations for distinct p_id. But sorting on p_id says nothing about sort order for the observations with the same p_id and that within-group sort order is not reproducible without some sufficient constraint. So the first observation could easily be different (unless naturally there is only one observation in each group), with the results you report.
Concretely, suppose that there are 3 observations for p_id 42. Then any of 6 possible orders of those observations is consistent with sorting on p_id. And so forth.
Presumably there is something special about one observation in each group. You would need to explain more about your data and what you want to get to allow fuller advice, but this problem is not a puzzle.

Stata: Newvar for multiple equal dates

I have trouble to generate a new variable which will be created for every month while having multiple entries for every month.
date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
Code would look like this gen newvar = sum(x*b) but I want to create the variable for each month.
What I tried so far was
to create an index for the date1 variable with
sort date1
gen n=_n
and after that create a binary marker for when the date changes
with
gen byte new=date1!=date[[_n-1]
After that I received a value for every other month but I m not sure if this seems to be correct or not and thats why I would like someone have a look at this who could maybe confirm if that should be correct. The thing is as there are a lot of values its hard to control it manually if the numbers are correct. Hope its clear what I want to do.

Two comments on your code
There's a typo: date[[_n-1] should be date1[_n-1]
In your posted code there's no need for gen n = _n.
Maybe something along the lines of:
clear
set more off
*-----example data -----
input ///
str10 date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
end
gen date2 = monthly(date1, "YM")
format %tm date2
*----- what you want -----
gen month = month(dofm(date2))
bysort month: gen newvar = sum(x*b)
list, sepby(month)
will help.
But, notice that the series of the cumulative sum can be different for each run due to the way in which Stata sorts and because month does not uniquely identify observations. That is, the last observation will always be the same, but the way in which you arrive at the sum, observation-by-observation, won't be. If you want the total, then use egen, total() instead of sum().
If you want to group by month/year, then you want: bysort date2: ...
The key here is the by: prefix. See, for example, Speaking Stata: How to move step by: step by Nick Cox, and of course, help by.

A major error is touched on in this thread which deserves its own answer.
As used with generate the function sum() returns cumulative or running sums.
As used with egen the function name sum() is an out-of-date but still legal and functioning name for the egen function total().
The word "function" is over-loaded here even within Stata. egen functions are those documented under egen and cannot be used in any other command or context. In contrast, Stata functions can be used in many places, although the most common uses are within calls to generate or display (and examples can be found even of uses within egen calls).
This use of the same name for different things is undoubtedly the source of confusion. In Stata 9, the egen function name sum() went undocumented in favour of total(), but difficulties are still possible through people guessing wrong or not studying the documentation really carefully.

How do I get a group minimum over level combinations of factors?

I would like to find minimum values within groups. In stata, I think it is simply "by group, sort : egen minvalue=min(value)"...
I tried to mess around with ave and rowsum, but to no avail.
ave(value, group, FUN=min) did not work.

Sorry this answer is a little late but, in case you are still looking for the answer or for future searchers here goes....
You are onto the right track with the -by- command. Here is what I'd do to find the lowest price of cars in the auto.dta dataset by domestic/foreign grouping.
sysuse auto, clear
bysort foreign : egen minprice = min(price)
What this does is create a new variable 'minprice' that holds the minimum price for domestic cars if a given car (observation) is domestic and vice versa for foreign cars. So this new variable with have just two values in this example and you can check that by doing:
tabulate minprice
Depending on why you wanted to find the minimum values by group this may not be what you had in mind but hopefully someone finds it helpful.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js