How can we calculate The standardized β will be used to contrast the impact of independent variable on the different dependent variables using Proc mixed repeated measure model.
Related
SAS provides a handy tool for handling GLM models with a large number of groups or fixed effects - "absorb" (PROC GLM; ABSORB). My understanding is that it factors the effect of the absorbed parameters out of the data before estimating the remaining parameters. It seems like this may invalidate the standard errors of the parameter estimates. The variance covariance matrix would not be calculated with the full set of parameters. Hence, we would be calculating the variance covariance matrix conditional on the absorbed parameters, where we would really want the unconditional variance. Am I misunderstanding what is going on here? I haven't found any good explanations in the help files.
I have done my best to search the web for an answer to my question, but haven't been able to find any. Maybe I'm not asking in the right way, or maybe my problem can't be solved... Well, here goes nothing!
When running a regression in SAS, it is possible to do backward or forward selection and thereby eliminating all insignificant variables, which is great, but just because the p-value of variable is ≤ 0.05, that doesn't necessarily mean that the result is correct.
E.g., I run a regression in SAS with the dependent variable being numbers of deaths due to illness and the independent variable being number of doctors. The result is significant with p ≤ 0.05, but the coefficient says, that as the number of doctors rises, the number of deaths also goes up. This would probably be the result of a spurious regression, but the causality is wrong, but SAS is only a computer, and doesn't know which way, the causality would go. (Of course it could also be true, that more doctors=more deaths due to some other factor, but let us ignore that for now).
My question is: Is it possible to make a regression and then tell SAS, that it must do backward/forward elimination, but according to some rules I set, it also has to exclude variables that don't meet these rules? E.g. if deaths goes up, as the number of doctors increase, exclude the variable number of doctors? And what would that
I really hope, that someone can help me, because I am running a regression for many different years with more than 50 variables, and it would be great if I didn't have to go through all results myself.
Thanks :)
I don't think this is possible or recommended. As mentioned, SAS is a computer and can't know which regression results are spurious. What if more doctors = more medical procedures = more death? Obviously you need to apply expert opinion to each situation but the above scenario is just as plausible.
You also mention 'share of docs' which isn't the actual number if I'm correct? So it may also be an artifact of how this metric is calculated.
If you have a specific set of rules you want to exclude that may be possible. But you have to first define all those rules and have a level of certainty regarding them.
If you need to specify unusual parameter selection criteria you could always roll your own machine learning by brute force: partition the data, run different regression models over all the partitions in macro loops, and use something like AIC to select the best model.
However, unless you are a machine learning expert it's probably best to start with something like proc glmselect.
SAS can do both forward selection and backwards elimination in the glmselect procedure, e.g.:
proc gmlselect data=...;
model ... / select=forward;
...
It would also be possible to combine both approaches - i.e. run several iterations of proc glmselect in macro loops, each with different model specifications, and then choose the best result.
I've been asked to run a model using gradient boosting or random forest. So far so good, however, the only output that comes back in terms of variable importance is based on the number of times a variable was used as a branch rule. I've now been asked to basically get coefficients or somehow quantify the impact that the variables have on the target.
Is there a way to do this with a gradient boosting model? My other thoughts were to either use only the variables that were showed to be sued as branch rules in a regular decision tree or in a GLM or regular regression model.
Any help or ides would be appreciated!! Thanks so much!
Just to make certain there is not a misunderstanding: SAS implementation of decision tree/gradient boosting (at least in EM) uses Split-Based variable Importance.
Split-Based Importance does NOT count the number splits made.
It is the ratio of the reduction of sum-of-squares by one variable (specific the sum over all splits by this variable) in relation to the reduction of sum-of-squares achieved by all splits in the model.
If you are using surrogate rules, highly correlated variables will receive roughly the same value.
I am a beginner in data mining. I am using weka. The data set has 109 variables of which many are nominal variables with many levels (1 to 8). My question is:
1.Should I convert the categorical variables (with upto 8 levels) to binary or use as it is?
Note: I'll be using logistic regression, random forest, naive bayes algorithm.
They should work as is, but you may have different results if you pre-process the categorical data into binary.
Logistic Regression, Random Forests and Naive Bayes appears to use nominal values quite fine in Weka. Some of these models may behave differently under the hood if you convert the attributes to binary. I don't think the Logistic Regression would make much of a difference, but I'm not so sure about the Random Forest or Naive Bayes.
i want to merge two datasets, but without using merge statement and Proc sql,can i do this?
is there any way to do the same
Yes there is: a join using hash tables.
See this document for an example: http://www.nesug.org/proceedings/nesug06/dm/da07.pdf
Advantages:
Faster in some cases
Disadvantages:
One of the tables needs to fit in memory
Syntax is very un-SAS-like (being closer to languages such as java)
Not everybody is familiar with the concept, certainly novice SAS users (can be problem for maintenance)
In my opinion, hash join is only useful in a very limited set of use cases. One example is where following 2 conditions are met:
You need to join information from a huge table with a small (easily fits in memory) table.
The large table is not sorted on the join variables and there is no added value in having it sorted.
When the small table gets extremely small (e.g. only 10 key values), i would maybe consider some approach using 2 macro variables and 2 arrays. This because the code would be as performant and easier to recognize as SAS by other people who might come after me and need to maintain it.
Conclusion: judging from the way the question is framed, you should go for SAS data step merge or proc sql join.
Although not a merge in the truest sense, if you have one large dataset and one small dataset, you can read the smaller DS in as a format, and then use proc format to put the values onto the larger one.
There are total 5 ways to Merge two datasets:
Proc SQL
Data Merge
Proc Format
Arrays
Hash Objects
Selection of the process depends on the size of the datasets and the primary key used.
Proc Format and Hash Object has been proved as the best method to merge large datasets with lesser Run Time.