sas covariates in a linear regressions - sas

I am running a simple linear regression in SAS. The regression has three different groups of participants as the predictors (with group 1 as the reference), the outcome a continuous social support variable, and five covariates. Three of the covariates are dichotomized (age, sex, & education), one is a three-level nominal variable (marital status), and the last is continuous (it's a chronic disease index).
My question is: Do I need to specify the different types of covariates in the SAS coding somehow?
Would this coding example be correct?:
proc glm data=work.example;
class group age sex education marital education chronic_diseases;
model social_support = group age sex education marital education chronic_diseases;
estimate 'group 1 vs group 2' group -1 1 0;
estimate 'group 1 vs group 3' group -1 0 1;
run;

The class statement tells SAS that you want to consider a variable non-continuous: that is, categorical or binary. It doesn't differentiate between the two, as it will choose the reference based on the first value in ascending order by default unless you specify a reference group.
For example, if you're comparing Apples and Oranges, SAS will use Apples as the reference value. Hey, they're fruit - you can compare fruit to fruit! :)
All model covariates are considered numeric unless specified in a class statement. Since chronic_diseases is continuous, simply remove it from the class statement; otherwise, SAS will look at every single value of chronic_diseases and consider it a level, then compare them all to the lowest level.
proc glm data=work.example;
class group age sex education marital education;
model social_support = group age sex education marital education chronic_diseases;
estimate 'group 1 vs group 2' group -1 1 0;
estimate 'group 1 vs group 3' group -1 0 1;
run;

Related

Combining surveys with distinct analytical weights in Stata

I have a dataset which combine 14 household surveys in 14 countries. Each survey was conducted in different years and each survey has a household weight variable that only specifies to this country's context (data structure is the same across 14 countries).
Now I merged them and tried to cross tabulate the country and gender_area (four types of value: male_rural, female_rural, male_urban, female_urban) variable with weights (tab country gender [aw=hhweight], m). But I found that such a cross-tabulation would create weird values for some of the countries.
For example, if I add one if condition by the end of the tab (tab country gender [aw=hhweight] if abc==1, m), some country (KHM, NPL) 's row total would be greater than their original row total without the condition. But in this dataset, a condition would give a smaller subsample. If I don't add the weight (tab country gender, m), there is no such a problem. If I just tab one country with weight, there is no such a problem either.
So I wonder if there is any way for me to compare all countries with weight. I am not that familiar with survey data reference in Stata (svyset, strata, etc).
I tried to refer to the book Applied Survey Data Analysis, but it seems that it doesn't contain methodology to deal with such a combination.

identify groups with few observations in paneldata models (stata)

How can I identify groups with few observations in panel-data models?
I estimated using xtlogit several random effects models. On average I have 26 obs per group but some groups only record 1 observation. I want to identify them and exclude them from the models... any suggestion how?
My panel data is set using: xtset countrycode year
Let's suppose your magic number for a big enough panel is 7 and that you fit a first model.
bysort countrycode : egen n_used = total(e(sample))
then gives you a count of how many observations were available and can be used, after which your criterion for a later model is if n_used >= 7
You could just go
bysort countrycode : gen n_available = _N
regardless of a model fit.
The differences are two-fold:
That last statement would disregard any missing values in the variables used in a model fit.
If you also used if and/or in to restrict model fit to particular subsets of observations, then e(sample) knows about that, but the last statement does not.

Count unique patients and overall observation using PROC SQL

Working in SAS but using some SQL code to count the number of unique patients but also the total number of observations for a set of indicators. Each record has a patient identifier, the facility where the patient is, and a group of binary indicators (0,1) for each bed section (the particular place in the hospital where the patient is). For each patient record, only 1 bed section can have a value of '1'. Overall, patients can have multiple observations in a bed section or in other bed sections, i.e. patients can be hospitalized > 1. The idea is to roll this data set up by facility and count the total # of admissions for each bed section but also the total people for each bed section. The people count will always be <= to the observation count. Counting people was just added to my to-do list and to this point I was only summing up observations for each bed section using the code below:
proc sql;
create table fac_bedsect as
select facility,
sum(bedsect_alc) as bedsect_alc,
sum(bedsect_blind) as bedsect_blind,
sum(bedsect_gen) as bedsect_gen
from bedsect_type
group by facility;
quit;
Is there a way I can incorporate into this code the # of unique people for each bed section? Thanks.
With no knowledge of the source table(s) it is impossible to answer precisely, but the syntax for counting distinct values is as seen below. You will need to use the correct column name where I have used "patient_id":
SELECT
facility
, COUNT(DISTINCT patient_id) AS patient_count
, SUM(bedsect_alc) AS bedsect_alc
, SUM(bedsect_blind) AS bedsect_blind
, SUM(bedsect_gen) AS bedsect_gen
FROM bedsect_type
GROUP BY
facility
;

SAS: Multiple patient diagnoses on multiple lines

I have a dataset of patient diagnoses with one diagnosis code per line, resulting in patient diagnoses on multiple lines. Each patient has a unique patientID. I also have age, race, gender, etc. data on these patients.
How do I indicate to SAS when using PROC FREQ, Logistic, Univariate, etc. that they are the same patient?
This is an example of what the data looks like:
patientID diagnosis age gender lab
1 15.02 65 M positive
1 250.2 65 M positive
2 348.2 23 M negative
2 282.1 23 M negative
3 50 F positive
I was given data on every patient who has had a certain lab (regardless of positive result), as well as all of their diagnoses, which each appear on a different line (as a different observation to SAS). First, I will need to exclude every patient who has a negative result for the lab, which I plan on using an IF statement for. The lab determines if the patient has disease X. Some patients do not have any additional diseases, other than disease X, such as patient #3.
Analyses I would like to perform:
Calculate the frequency of each disease using PROC FREQ.
Characterize the age and race relationships for each diagnosis using PROC FREQ chi square.
PROC Logistic to determine risk factors (age, race, gender, etc.)for developing an additional disease on top of disease X.
Thanks!
The answer to your question is you cannot by default. But when you're processing the data you can account for it easily. IMO keeping it long is easier.
You've asked too many questions above so I'll answer just one, how to count the number of people with disease x.
Proc sort data = have out = unique_disease_patient nodupkey;
By patientID Diag;
Run;
Proc freq data = unique_disease_patient noprint;
Table disease / out = disease_patient_count;
Run;
Note that this is much easier in SQL
Proc sql;
Create table want as
Select diag, count(distinct patientID)
From have
Group by diag;
Quit;
I'm assuming this is homework because you're unlikely to do this in practice except for exploratory analysis.

EU-SILC database about education and experience

I am using EU-SILC database for 2008 for Greece. Firstly, I would like to use PE040 so as to create three dummies: primeduc for education on pre-primary AND primary school seceduc on lower secondary education +(upper) secondary + post-secondary non tertiary education and tereduc on 1st + 2nd tertiary stage.
Secondly, I would like to make a variable about working experience based on the idea exper=age-educ-6 where educ I would like sth about the years (generally) spent in education.
Any ideas of which commands I should use on stata???
What I've tried so far
About stata syntax:
tabulate PE040, gen(educ)
gen primeduc=educ1+educ2
gen seceduc=educ3+educ4+educ5
gen tereduc=educ6
Having defined lnwage as =log(PY010N/(PL060+PL070)) and age as =2008-PB140, I've tried to regress and it takes only into account 191 obs.
For your first question, I think you want a 0-1 indicator, equal to 1 if either of the indicated educational categories was recorded.
gen primeduc=educ1 | educ2
gen seceduc =educ3 |educ4 |educ5
The "|" stands for logical "or". For example, primeduc will be 1 if educ1 is 1 or educ2 is 1.