Cleaner way of handling addition of summarizing rows to table? - sas

I have a dataset that is unique by 5 variables, with two dependent variables. My goal is for this dataset to have appended to it additional rows with TOTAL as the value of independent variables, with the values of the dependent variables changing accordingly.
To do this for a single independent variable is not a problem, I would do something along the lines of:
proc sql;
create table want as
select "TOTAL" as independent_var1,
independent_var2,
...
independent_var5,
sum(dependent_1) as dependent_1,
sum(dependent_2) as dependent_2
from have
group by independent_var1,...,independent_var5;
quit;
Followed by appending the original dataset in whatever fashion you choose. However, I want the above, yet x5 (for each independent variable), and then again for each possible combination of TOTAL/nontotal across the 5 independent variables. Not sure just how many datasets that is off the top of my head...but it's a decent amount.
So best strategy I've come up with so far is to use the above with some mildly creative macro code to generate all possible table combinations of total/non-total, but it seems like SAS just might have a better way, maybe tucked away in an esoteric proc step I've never heard of...
--
Attempt to show example, using three independent variables and 1 dependent variable:
Ind1|2|3|Dependent1
0 0 0 1
0 0 1 3
0 1 0 5
0 1 1 7
Desired output would be:
0 0 ALL 4
0 1 ALL 12
0 ALL 0 6
0 ALL 1 10
ALL 0 0 1
ALL 0 1 3
ALL 1 0 5
ALL 1 1 7
0 ALL ALL 16
ALL 0 ALL 4
ALL 1 ALL 12
ALL ALL 0 6
ALL ALL 1 10
ALL ALL ALL 16
0 0 0 1
0 0 1 3
0 1 0 5
0 1 1 7
I may have forgotten some combinations, but that should serve to get the point across.

PROC MEANS should do this for you trivially. You need to clean up the output in order to get it to perfectly match what you want (missing for INDx = "ALL" in your example) but otherwise it gets the calculations done properly.
data have;
input Ind1 Ind2 Ind3 Dependent1;
datalines;
0 0 0 1
0 0 1 3
0 1 0 5
0 1 1 7
;;;;
run;
proc means data=have;
class ind1 ind2 ind3;
var dependent1;
output out=want sum=;
run;

Related

Merge rows with unique ID in stata

I have a dataset where I need unique county FIPS codes that need to be merged. The dataset looks like:
FIPS yr1990 yr2000 yr2010
1001 1 0 1
1002 1 1 0
1003 1 0 0
1004 0 0 0
1005 0 0 1
County boundaries have changed and I need to merge several FIPS codes together. Essentially, I need the dataset to look like:
FIPS yr1990 yr2000 yr2010
1001/1003 1 1 1
1002 1 1 0
1004/1005 0 0 1
Is there a way to select specific FIPS to be merged over rows?
This solution might not scale to very large datasets as writing the replace statements must be done manually. But it keeps the exact format you are using in your example. And a more scalable way might be difficult if there is no system in how the FIPS codes were combined.
* Example generated by -dataex-. For more info, type help dataex
clear
input str4 FIPS byte(yr1990 yr2000 yr2010)
"1001" 1 0 1
"1002" 1 1 0
"1003" 1 0 0
"1004" 0 0 0
"1005" 0 0 1
end
*Combine the FIPS codes
replace FIPS = "1001/1003" if inlist(FIPS,"1001","1003")
replace FIPS = "1004/1005" if inlist(FIPS,"1004","1005")
*Collapse rows by FIPS value, use max value for each var on format yr????
collapse (max) yr???? , by(FIPS)

Giving subjects a binary id they keep for every period

In Stata I have a list of subjects and contributions from an economic experiment.
There are multiple rounds being played for each treatment. Now I want to keep track of those who contributed in the first period and give them either 1 if a contributor or 0 if a defector. The game is played for multiple periods, but I only really care about the first round. My current code looks like this
g firstroundcont = 0
replace firstroundcont = 1 if c>0 & period==1
This however results in everyone getting a 0 for every subsequent period meaning that they are not "identified" as either a "first round" contributor or a defector for all other periods in the dataset. The table below shows a snippet of how my data looks and how the variable firstroundcont should look.
sessionID
period
subject
group
contribution
firstroundcont
1
1
1
1
4
1
1
1
2
1
0
0
1
1
3
1
2
1
1
1
4
2
10
1
1
1
5
2
0
0
1
1
6
2
0
0
1
2
1
1
0
1
1
2
2
1
5
0
1
2
3
1
0
1
#JR96 is right: this sorely and surely needs a data example. But I guess you want something with the flavour of
bysort id (period) : gen wanted = c[1] > 0
See https://www.stata.com/support/faqs/data-management/creating-dummy-variables/ and https://www.stata-journal.com/article.html?article=dm0099 for more on how to get indicators in one step. The business of generating with 0 and then replacing with 1 can usually be cut to a direct one-line statement.

Assigning a value to a certain number of rows within a "by" group - SAS

I've spent quite a lot of time on Stack Overflow looking for answers to other questions, but I'm really stuck on this one, so I'm finally asking a question!
I have a dataset of fish in SAS, with:
a unique ID for each angler
three different variables with number of fish released in each category by that angler: over legal size, under legal size, and released dead
a sequential number (fishno) based on the number of rows for each ID; 1 to the last row of that ID.
Variable to be created: Disposition--could be either character variable with "legal" "under" "dead" options or even numeric values of 1-3.
It was originally set up with one row per unique ID, but I set it so that now there is one row per fish discarded (i.e. if there were 3 legal size and 2 undersize fish, I now have 5 rows).
I need to assign, by unique ID, whether each row/fish was released legal, undersize or dead. In the previous example, for a unique ID, I'd need 3 rows assigned to a Disposition of "legal" and 2 rows assigned to a Disposition of "under".
I've tried first.var statements along with if-then-do statements; played around with macros; nothing worked quite right and I'm pretty stuck here. Is there some sort of random assignment I should try? Is there a much easier way that I'm missing?
Example of the data below...
THANK YOU!!
Data in Excel format
Assuming you already have the FISHNO variable, there needs to be some method for assigning each fish as legal, dead, or undersize. The following code will assign the disposition in the that order:
data have;
input ID LEGAL DEAD UNDERSIZE FISHNO;
datalines;
15 1 0 1 1
15 1 0 1 2
29 2 0 2 1
29 2 0 2 2
29 2 0 2 3
29 2 0 2 4
38 1 0 1 1
38 1 0 1 2
53 1 0 1 1
53 1 0 1 2
55 1 0 1 1
55 1 0 1 2
;
run;
data want;
set have;
if legal>0 and legal>=fishno then disposition = 'legal';
else if dead>0 and legal+dead>=fishno then disposition = 'dead';
else if undersize>0 and legal+dead+undersize>=fishno then disposition = 'under';
run;

Multiple conditions for same variable in SAS

I'm trying to detect specific values of one variable and create a new one if those conditions are fulfilled.
Here's a part of my data (I have much more rows) :
id time result
1 1 normal
1 2 normal
1 3 abnormal
2 1 normal
2 2
3 3 normal
4 1 normal
4 2 normal
4 3 abnormal
5 1 normal
5 2 normal
5 3
What I want
id time result base
1 1 normal
1 2 normal x
1 3 abnormal
2 1 normal x
2 2
2 3 normal
3 3 normal
4 1 normal
4 2 normal x
4 3 abnormal
5 1 normal
5 2 normal x
5 3
My baseline value (base) should be populated when result exists at timepoint (time) 2. If there's no result then baseline should be at time=1.
if result="" and time=2 then do;
if time=10 and result ne "" then base=X; end;
if result ne "" and time=2 then base=X; `
It works correctly when time=2 and results exists. But if results missing, then there's something wrong.
The question seems a bit off. "Else if time="" and time=1" There seems to be a typo there somewhere.
However, your syntax seems solid. I've worked an example with your given data. The first condition works, but second (else if ) is assumption. Updating as question is updated.
options missing='';
data begin;
input id time result $ 5-20 ;
datalines;
1 1 normal
1 2 normal
1 3 abnormal
2 1 normal
2 2
3 3 normal
4 1 normal
4 2 normal
4 3 abnormal
;
run;
data flagged;
set begin;
if time=2 and result NE "" then base='X';
else if time=1 and id=2 then base='X';
run;
Edit based on revisited question.
Assuming that the time-point (1) is always next to the point (2). (If not, then add more lags.) Simulating the Lead function we sort the data backwards and utilize lag.
proc sort data=begin; by id descending time; run;
data flagged;
set begin;
if lag(time)=2 and lag(result) EQ "" then base='X';
if time=2 and result NE "" then base='X';
run;
More about opposite of lag: https://communities.sas.com/t5/SAS-Communities-Library/How-to-simulate-the-LEAD-function-opposite-of-LAG/ta-p/232151

Logistic Time Discrete Hazard Model Parameter Estimates Intrepration

I am using PROC GLIMMIX, and I'm curious as to why my parameter estimates are behaving strangely.
proc glimmix data=blah pconv=1e-3;
class strata1;
model event(event=LAST)=time1--time20/
noint solution link=logit dist=binary;
nloptions tech=nrridg;
covtest 'var(strata1)=0'/WALD;
random intercept/subject=strata1;
run;
Since I'm using a logistic discrete time hazard model (without any censored observations), I have my dataset constructed using the 'person-period' dataset. Here is an example of what a person-period dataset looks like:
id time1 time2 time3 time4 event
100 1 0 0 0 0
100 0 1 0 0 0
100 0 0 1 0 1
101 1 0 0 0 1
102 1 0 0 0 0
102 0 1 0 0 0
102 0 0 1 0 0
102 0 0 0 1 0
Essentially, each 'time' variable represents whether this period is occuring. So, time1=1 during the first period, 0 otherwise. And then time2=1 during the first period, 0 otherwise, and so on. I am modelling the probability that the event occurs during each of these periods. When I use PROC LOGISITIC, I get sensible parameter estimates.
proc logistic data=blah;
model event (event=LAST)=time1--time20 /noint;
run;
This code delivers parameter estimates for time1=-3.0052, which gives me a probability of the event occuring in time period 1 of .047. These estimates slowly get smaller, for each time[i] variable, which is what I would expect. However, when I run my GLIMMIX code and add in this random effect for strata1, it blows up my model - the parameter estimates for time flip their sign. time1=2.84, time2=2.67, time3=2.41, and they consistently get smaller. I'm really confused as to why- this model is telling me that the probability of the event occuring is over 90% in this period, which I know to be untrue. Does anyone have any idea why this is? I would expect these estimates to essentially have their negative sign be flipped.
Thanks.