what is difference between datalines and cards statement - sas

I am seeking job on SAS tool. interviewer asked me this question I said no difference, both are used to read internal data but he told there is a difference between cards and datalines statement .
kindly help me to crack this question.

There is no significant difference between the two. CARDS is defined as an alias of DATALINES, and by definition an alias should have identical behavior to the statement it is an alias of.
There is an insignificant difference, which is really simply an undefined behavior; it is explained on page 4 of Andrew Kuligowski's 2007 paper, and related to how cards and datalines have slightly different results when used in a file statement (which they are not intended to be used with). The behavior is slightly different now for datalines in modern SAS (9.4); it gives an error, but a different one from cards. However, it is simply undefined behavior; it would be an absurd thing for an interviewer to expect as an answer.
Your interviewer may have been referring to datalines versus datalines4, or cards versus cards4; those are different, because the latter requires four ; to end instead of one (to allow for the possibility of a semicolon in the data itself).

Related

How to best calculate correlations for observations that are not independent of each other (in Stata)?

I have the following case. My data set consists of 78 policy documents (= my observations). These were written by 50 different country governments (in the period between 2005 and 2020). While 27 countries have written only one policy document, 23 countries have written multiple policy documents. In the latter case, these same-country different-policy documents have usually been written years apart by different governments/administrations and different ministries. Nevertheless, I reckon there is probably a risk that the observations are not independent. My overarching question is, therefore: How would you calculate correlations in this case? More specifically:
Pearson assumes the independence of the observations, thus, is not suitable here, correct? Or could one even credibly argue that the observations are independent after all, since they were usually published many years (and therefore governments) apart and by different ministries?
Would "within-participants correlation" (Bland & Altman 1995 a & b) or "repeated measures correlation" (= RMCORR in R and Stata) be more suitable? Or is something else more appropriate?
Furthermore: Would I otherwise have to take into account any time effects when running correlations in my setting?
Thank you very much for your advice!
Disclaimer: also posted at Statalist here.

Why is the second parameter of std::assoc_laguerre an unsigned int?

In C++17, a lot of special functions were added to the standard library. One function is the associated Laguerre polynomials. The second argument requires an unsigned int, but the mathematical definition is valid for real numbers as well. Is there any reason that it's only limited to nonnegative integers? Is it simply due to the fact that binomial(n,k) is is easier/faster/simpler to calculate when n and k are both positive integers?
Walter E. Brown, father of special functions in C++, never explicitly answered your Laguerre question as far as I know. Nevertheless, when one reads what Brown did write, a likely motive grows clear:
Many of the proposed Special Functions have definitions over some or all of the complex plane as well as over some or all of the real numbers. Further, some of these functions can produce complex results, even over real-valued arguments. The present proposal restricts itself by considering only real-valued arguments and (correspondingly) real-valued results.
Our investigation of the alternative led us to realize that the complex landscape for the Special Functions is figuratively dotted with land mines. In coming to our recommendation, we gave weight to the statement from a respected colleague that “Several Ph.D. dissertations would [or could] result from efforts to implement this set of functions over the complex domain.” This led us to take the position that there is insufficient prior art in this area to serve as a basis for standardization, and that such standardization would be therefore premature....
Of course, you are asking about real numbers rather than complex, so I cannot prove that the reason is the same, but Abramowitz and Stegun (upon whose manual Brown's proposal was chiefly based) provide extra support to some special functions of integer order. Curiously, in chapter 13 of my copy of Abramowitz and Stegun, I see no extra support, so you have a point, don't you? Maybe Brown was merely being cautious, not wishing to push too much into the C++ standard library all at once, but there does not immediately seem to me to be any obvious reason why floating-point arguments should not have been supported in this case.
Not that I would presume to second-guess Brown.
As you likely know, you can probably use chapter 13 of Abramowitz and Stegun to get the effect you want without too much trouble. Maybe Brown himself would have said just this: you can get the effect you want without too much trouble. To be sure, one would need to ask Brown himself.
For information, Brown's proposal, earlier linked, explicitly refers the assoc_laguerre of C++ to Abramowitz and Stegun, sect. 13.6.9.
In summary, at the time C++ was first getting special-function support, caution was heeded. The library did not want to advance too far, too fast. It seems fairly probable that this factor chiefly accounts for the lack you note.
I know for a fact that there was a great concern over real and perceived implementability. Also, new library features have to justified. Features have to be useful to more than one community. Integer order assoc_laguerre is the type physicists most often use. A signed integer or real order was probably thought too abstract. As it was, the special math functions barely made it into C++17. I actually think part of the reason they got in was the perception that 17 was a light release and folks wanted to bolster it.
As we all know, there is no reason for the order parameter to be unsigned, or even integral for that matter. The underlying implementation that I put in libstdc++ ( gcc) has a real order alpha. Real order assoc_laguerre is useful in quadrature rules for example.
I might recommend making overloads for real order to the committee in addition to relaxing the unsigned integer to be signed.

SAS stdi output clarification

I'm attempting to convert some old SAS macros into Python, and am a bit unclear by some of the terminology used in SAS. In the macro, the PROC statement is
proc reg data=model_file;
model &y = &x;
output out=&outfile r=resid stdi=resid_error;
I understand that r means the individual residual per data point, but was unclear what stdi meant. According to the SAS manual, stdi means "standard error of the individual predicted value", so there is one stdi for each row in the dataset. I searched around a bit and found this lecture slide from the University of Wisconsin which I believe explains how to calculate stdi:
However, two (EDIT: ONE) questions remain:
Is the method for calculating standard error of individual
prediction in the lecture slide indeed correct? I've never seen this
method before so I still have my doubts. I've looked up the SAS manual, but their definition for STDI is a bit confusing:. Specifically, h(i) is defined as but I don't know what the bar after [X'X] is supposed to mean.
The way the standard error of individual predictions is calculated
here utilizes x. However, what happens if you have run a
regression with multiple X columns? Does stdi assume only a
single X column?
Answer: the answer is no. You can have multiple X columns and still a STDI value.
I'm not a statistician and your question could have included a lot more detail, but a quick Google search suggests that you're looking at a PROC REG. The main documentation for PROC REG is here:
https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_reg_sect015.htm
and there is a dedicated page for "Model Fit and Diagnostic Statistics" including the relevant formulae here:
https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_reg_sect039.htm
Maybe that will answer your question. Although these things don't interest me directly, I believe that SAS's documentation is pretty good at always describing the exact computations being done by each procedure.

Exclude variables in regression based on causality-criteria in SAS

I have done my best to search the web for an answer to my question, but haven't been able to find any. Maybe I'm not asking in the right way, or maybe my problem can't be solved... Well, here goes nothing!
When running a regression in SAS, it is possible to do backward or forward selection and thereby eliminating all insignificant variables, which is great, but just because the p-value of variable is ≤ 0.05, that doesn't necessarily mean that the result is correct.
E.g., I run a regression in SAS with the dependent variable being numbers of deaths due to illness and the independent variable being number of doctors. The result is significant with p ≤ 0.05, but the coefficient says, that as the number of doctors rises, the number of deaths also goes up. This would probably be the result of a spurious regression, but the causality is wrong, but SAS is only a computer, and doesn't know which way, the causality would go. (Of course it could also be true, that more doctors=more deaths due to some other factor, but let us ignore that for now).
My question is: Is it possible to make a regression and then tell SAS, that it must do backward/forward elimination, but according to some rules I set, it also has to exclude variables that don't meet these rules? E.g. if deaths goes up, as the number of doctors increase, exclude the variable number of doctors? And what would that
I really hope, that someone can help me, because I am running a regression for many different years with more than 50 variables, and it would be great if I didn't have to go through all results myself.
Thanks :)
I don't think this is possible or recommended. As mentioned, SAS is a computer and can't know which regression results are spurious. What if more doctors = more medical procedures = more death? Obviously you need to apply expert opinion to each situation but the above scenario is just as plausible.
You also mention 'share of docs' which isn't the actual number if I'm correct? So it may also be an artifact of how this metric is calculated.
If you have a specific set of rules you want to exclude that may be possible. But you have to first define all those rules and have a level of certainty regarding them.
If you need to specify unusual parameter selection criteria you could always roll your own machine learning by brute force: partition the data, run different regression models over all the partitions in macro loops, and use something like AIC to select the best model.
However, unless you are a machine learning expert it's probably best to start with something like proc glmselect.
SAS can do both forward selection and backwards elimination in the glmselect procedure, e.g.:
proc gmlselect data=...;
model ... / select=forward;
...
It would also be possible to combine both approaches - i.e. run several iterations of proc glmselect in macro loops, each with different model specifications, and then choose the best result.

SAS Enterprise Guide, different treatments for missing variables

We are using the ESS data set, but are unsure how to deal with the issue of missing values in SAS Enterprise Guide. Our dependent variable is "subjective wellbeing", and aim to include a large amount of control variables - hence, we have a situation where we have a data set containing a lot of missing values.
We do not want to use "list-wise deletion". Instead, we would like to treat the different missings in different manners depending on the respondent's response: "no answer", "Not applicable", "refusal", "don't know". For example, we plan to conduct pair-wise deletion of non-applicable, while we might want to use e.g. the mean value for some other responses - depending on the question (under the assumption that the respondent's response provide information about MCAR, MAR, NMAR).
Our main questions are:
Currently, our missing variables are marked in different ways in the data set (99, 77, 999, 88 etc.), should we replace these values in Excel before proceeding in SAS Enterprise Guide? If yes - how should we best replace them as they are supposed to be treated in different ways?
How do we tell SAS Enterprise Guide to treat different missings in different ways?
If we use dummy variables to mark refusals for e.g. income, how do we include these in the final regression?
We have tried to read about this but are a bit confused, so we would really appreciate any help :)
On a technical note, SAS offers special missing values: .a .b .c etc. (not case sensitive).
Replace the number values in SAS e.g. 99 =.a 77 = .b
Decisions Trees for example will be able to handle these as separate values.
To keep the information of the missing observations in a regression model you will have to make some kind of tradeoff (find the least harmful solution to your problem).
One classical solution is to create dummy variables and replace the
missing values with the mean. Include both the dummies and the
original variables in the model. Possible problems: The coefficients
will be biased, multicollinearity, too many categories/variables.
Another approaches would be to BIN your variables into categories. Do
it just by value (e.g. deciles) and you may suffer information loss. Do it by theory and
you may suffer confirmation bias.
A more advanced approach would be to calculate the information
value
(http://support.sas.com/resources/papers/proceedings13/095-2013.pdf)
of your independent variables. Thereby replacing all values including
the missings. Of cause this will again lead to bias and loss of
information. But might at least be a good step to identify
useful/useless missing values.