Stata generate by groups - stata

I have a very weird thing happening with my code. I have panel data set with the panel id being p_id and I am trying to create a another variable by using panel_id. My code is this, where p_id is the panel id, marital_status of person observed in each time period and x is the variable I would want to create.
bys p_id: gen count =_N
bys p_id: gen count1 =_n
bys p_id: gen x= marital_status if count1 ==1
However when I do
tab x
I get different numbers for rows (row total does not change) each time I run this code. The numbers are pretty closely clustered, but I need to understand why this is happening.

Although the lack of a reproducible example is poor practice, it is possible to guess at what is going on. The first line of code is not problematic, but the second two have the same effect as
bys p_id: gen x = marital_status if _n == 1
In words, the new variable contains marital status data from the first observation in each group of observations for distinct p_id. But sorting on p_id says nothing about sort order for the observations with the same p_id and that within-group sort order is not reproducible without some sufficient constraint. So the first observation could easily be different (unless naturally there is only one observation in each group), with the results you report.
Concretely, suppose that there are 3 observations for p_id 42. Then any of 6 possible orders of those observations is consistent with sorting on p_id. And so forth.
Presumably there is something special about one observation in each group. You would need to explain more about your data and what you want to get to allow fuller advice, but this problem is not a puzzle.

Related

Ranking sum of data by two categories

I have data that I would like to rank by two separate categories, State and ServiceType. Essentially, there are multiple years of data for each ServiceType across various states, and I was hoping to get the sum of all years for each ServiceType by State, meaning each State is treated independently and the sums of the various categories are ranked only within that state, not nationally.
I've tried
bys State ServiceCategory (quant_variable): ///
egen rank_quant_variable= rank(sum(quant_variable)), field
as well as a version of above where I used a pre-calculated sum variable. Both don't really work.
This lacks a reproducible example, as you do not give your data or phrase your problem in terms of a dataset we could download, for example as loaded with or referred to in Stata. There is no need to give the full dataset but just a minimal example with the same structure.
The call to sum() here would be to Stata's sum() function, which yields the cumulative or running sum, which evidently isn't what you want. So that case is easy to dismiss.
The problem remaining is quite what you did in the code you don't show with a pre-calculated sum.
At a guess you worked out
bys State ServiceCategory: egen sum = total(quant_variable)
and then pushed that sum through rank(). But that would use each value of sum as many times as it occurred.
Perhaps you want something more like this:
egen tag = tag(State ServiceCategory)
bysort State: egen rank_quant_variable = rank(sum) if tag, field
bysort State (rank): replace rank = rank[1]
But it's really hard (for me) to visualize this without details on what you did or an example to work on.

Dealing with repetitive questions or text in a Proc Transpose

Hello I have this data
And I need to perform these edit checks
"If answer to “Did the subject consume the entire high fat/high calorie
breakfast?” is YES, then “If no, did the subject consume at least 50%
of the high fat/high calorie breakfast?” fasting question must not be
present"
"If answer to “Did the subject consume the entire high fat/high calorie
breakfast?” is NO, then “If no, did the subject consume at least 50%
of the high fat/high calorie breakfast?” fasting question must be
present"
"If answer to “Did the subject consume the entire high fat/high calorie
breakfast?” is NO, then answer to “If no, did the subject consume at
least 50% of the high fat/high calorie breakfast?” must be present"
I have written the following code
*data fq;
set dm.fq;
ptno=strip(compress(clientid,'-'))+0;
run;
proc sort;
by ptno period day hour;
run;
proc transpose data=fq out=tempfq (DROP=_NAME_ _LABEL_);
by ptno period day hour;
var fq_yn;
id fq_qst;
run;
data final;
set tempfq;*
However, I get the following error in the log
ERROR: The ID value "DID_THE_SUBJECT_FAST_AT_LEAST_4" occurs twice in the same BY group.
And the headers come out capitalised and truncated
How do I deal with the error in the log and how can I expand the column width to produce the whole question when transposing?
You will need do two things
determine what you want for repeated questions, such as:
for the case of occurs twice in the same BY group.
do you want 2 columns, or
one column from the first occurrence, or
one column from the second occurrence, or
one column with YES if both YES & NO occur, or
one column with NO if both YES & NO occur, or
one column with MULTI if both YES & NO occur, or
one column with YES, NO if they occur in that order per period, day, hour, or
one column with NO, YES if they occur in that order per period, day, hour
one column with NO, YES regardless of order
one column with YES, NO regardless of order
one column with (freq) suffix such as YES(2) or NO(2)
same ideas for cases of BY GROUP replicates of 3 or more
each idea will require some preprocessing of the data before TRANSPOSE
Use IDLABEL to have the whole original question in the column header when output.
Pivoting survey data may be useful for modeling and forecasting purposes, however, if done for reporting purposes you might be better off using Proc TABULATE or Proc REPORT.
Possible 'Fix'
When an ID value occurs more than once in a BY group there are more than one VAR values going into a single pivot destination. Hence the ERROR:
For the case of a question repeated and having the same answer within group ptno period day hour you can SORT by one key more, adding FQ_QST and specify option NODUPKEY. After such sorting no duplicates of FQ_QST occur, only a single FQ_YN value is being pivoted into a column via ID.
proc sort NODUPKEY;
by ptno period day hour FQ_QST;
run;
If you have data with repeated questions within group and the questions have different answers, the answer remaining per NODUPKEY is dependent on how SORT is being run. From help:
When the SORT procedure’s input is a Base SAS engine data set and the sorting is done by SAS, then the order of observations within an output BY group is predictable. The order of the observations within the group is the same as the order in which they were written to the data set when it was created. Because the Base SAS engine maintains observations in the order that they were written to the data set, they are read by PROC SORT in the same order. While processing, PROC SORT maintains the order of the observations because it uses a stable sorting algorithm. The stable sorting algorithm is used because the EQUALS option is set by default. Therefore, the observation that is selected by PROC SORT to be written to the output data set for a given BY group is the first observation in the data set having the BY variable values that define the group.

Stata: Newvar for multiple equal dates

I have trouble to generate a new variable which will be created for every month while having multiple entries for every month.
date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
Code would look like this gen newvar = sum(x*b) but I want to create the variable for each month.
What I tried so far was
to create an index for the date1 variable with
sort date1
gen n=_n
and after that create a binary marker for when the date changes
with
gen byte new=date1!=date[[_n-1]
After that I received a value for every other month but I m not sure if this seems to be correct or not and thats why I would like someone have a look at this who could maybe confirm if that should be correct. The thing is as there are a lot of values its hard to control it manually if the numbers are correct. Hope its clear what I want to do.
Two comments on your code
There's a typo: date[[_n-1] should be date1[_n-1]
In your posted code there's no need for gen n = _n.
Maybe something along the lines of:
clear
set more off
*-----example data -----
input ///
str10 date1 x b
1925m12 .01213 .323
1925m12 .94323 .343
1926m01 .34343 .342
end
gen date2 = monthly(date1, "YM")
format %tm date2
*----- what you want -----
gen month = month(dofm(date2))
bysort month: gen newvar = sum(x*b)
list, sepby(month)
will help.
But, notice that the series of the cumulative sum can be different for each run due to the way in which Stata sorts and because month does not uniquely identify observations. That is, the last observation will always be the same, but the way in which you arrive at the sum, observation-by-observation, won't be. If you want the total, then use egen, total() instead of sum().
If you want to group by month/year, then you want: bysort date2: ...
The key here is the by: prefix. See, for example, Speaking Stata: How to move step by: step by Nick Cox, and of course, help by.
A major error is touched on in this thread which deserves its own answer.
As used with generate the function sum() returns cumulative or running sums.
As used with egen the function name sum() is an out-of-date but still legal and functioning name for the egen function total().
The word "function" is over-loaded here even within Stata. egen functions are those documented under egen and cannot be used in any other command or context. In contrast, Stata functions can be used in many places, although the most common uses are within calls to generate or display (and examples can be found even of uses within egen calls).
This use of the same name for different things is undoubtedly the source of confusion. In Stata 9, the egen function name sum() went undocumented in favour of total(), but difficulties are still possible through people guessing wrong or not studying the documentation really carefully.

Stata: using egen group() to create unique identifiers

I have a dataset where each row is a firm, year pair with a firmid that is a string.
If I do
duplicates drop firmid year, force
it doesn't delete anything since there are no duplicates (I originally created the dataset after running duplicates drop firmid year, force).
So far so good. I want to create a panel which requires a firmid that is numeric. So I run
egen newid = group(firmid)
xtset newid year
But the 'repeated time values in panel' error pops up. Moreover,
duplicates list newid year
lists a whole bunch of duplicates.
It seems as though egen, group() isn't generating unique groups. My question is: why, and how do I create unique groups in a robust way?
This is an old thread, but I have recently experienced the same symptoms, so I wanted to share my solution. Of course, so long as the questioner does not give further details, we will not know whether the causes are the same for me and him.
The problem turned out to be an issue of precision. As explained here in section 4.4, calculations done on integers stored as floats are precise only in the range up to 16,777,216. So, if you have more than 16,777,216 firms in your sample, rounding error will result in the same ID being assigned to multiple firms. This is straightforwardly dealt with by increasing the precision of the ID variable to long:
egen long newid = group(firmid)

New SAS variable conditional on observations

(first time posting)
I have a data set where I need to create a new variable (in SAS), based on meeting a condition related to another variable. So, the data contains three variables from a survey: Site, IDnumb (person), and Date. There can be multiple responses from different people but at the same site (see person 1 and 3 from site A).
Site IDnumb Date
a 1 6/12
b 2 3/4
c 4 5/1
a 3 .
d 5 .
I want to create a new variable called Complete, but it can't contain duplicates. So, when I go to proc freq, I want site A to be counted once, using the 6/12 Date of the Completed Survey. So basically, if a site is represented twice and contains a Date in one, I want to only count that one and ignore the duplicate site without a date.
N %
Complete 3 75%
Last Month 1 25%
My question may be around the NODUP and NODUPKEY possibilities. If I do a Proc Sort (nodupkey) by Site and Date, would that eliminate obs "a 3 ."?
Any help would be greatly appreciated. Sorry for the jumbled "table", as this is my first post (hints on making that better are also welcomed).
You can do this a number of ways.
First off, you need a complete/not complete binary variable. If you're in the datastep anyway, might as well just do it all there.
proc sort data=yourdata;
by site date descending;
run;
data yourdata_want;
set yourdata;
by site date descending;
if first.site then do;
comp = ifn(date>0,1,0);
output;
end;
run;
proc freq data=yourdata_want;
tables comp;
run;
If you used NODUPKEY, you'd first sort it by SITE DATE DESCENDING, then by SITE with NODUPKEY. That way the latest date is up top. You also could format COMP to have the text labels you list rather than just 1/0.
You can also do it with a format on DATE, so you can skip the data step (still need the sort/sort nodupkey). Format all nonmissing values of DATE to "Complete" and missing value of date to "Last Month", then include the missing option in your proc freq.
Finally, you could do the table in SQL (though getting two rows like that is a bit harder, you have to UNION two queries together).