I was summarizing a dataset with approximately 1 million rows by 5 category variables. I initially tried this with PROC MEANS but kept getting memory warnings and the process took forever. I ended up just using PROC SQL with a GROUP BY statement and the whole process only took around ~ 10 seconds. Why is their such a performance difference? My understanding was that PROC MEANS just creates an SQL statement in the background so I would assume the two methods would be very similar.
If you prefer the proc notation, you can also use proc summary instead of proc means or proc sql as it's almost identical to proc means but for some reason it uses less memory.
Not sure why this is the case but I know in the past I have avoided memory issues by switching code from proc means to proc summary.
Related
Consider two variables sex and age, where sex is treated as categorical variable and age is treated as continuous variable. I want to specify the full factorial using the non-interacted age-term as baseline (the default for sex##c.age is to omit one of the sex-categories).
The best I could come up with, so far, is to write out the factorial manually (and omitting `age' from the regression):
reg y sex#c.age i.sex
This is mathematically equivalent to
reg y sex##c.age
but allows me to directly infer the regression coefficients (and standard errors!) on the interaction term sex*age for both genders.
Is there a way to stick to the more economic "##" notation but make the continuous variable the omitted category?
(I know the manual approach has little overhead notation in the example given here, but the overhead becomes huge when working with triple-interaction terms).
Not a complete answer, but a workaround: One can use `lincom' to obtain linear combinations of coefficients and standard errors. E.g., one could obtain the combined effects for both sexes as follows:
reg y sex##c.age
forval i = 1/2 {
lincom age + `i'.sex#c.age
}
I need to compare (pairwise) observations in N data sets. All data sets have the same table, variable attributes. I care only if differences in observations are detected, and I need to know which two tables, and the specific differences. What's the most efficient way to do it? I have some thing below and I'd appreciate comments/suggestion.
I am currently using the following codes:
proc compare base = a compare = b
outnoequal outbase outcomp outdif noprint out = a_b_out;
run;
So for every data set a_b_out, I only care if it is not empty. SAS log can produce NOTE for this.
I am currently using macro to check if it is empty in each a_b_out, and conditionally output it to a permanent directory.
Is there a better (more efficient way)?
Sure, that's a reasonable way to do it. Put that in a macro, and have the macro both run the compare and check for the differences.
You can actually also use the system code after the PROC COMPARE is run to see what kind of difference was reported, if you want to; &SYSINFO. contains a value describing what kind of differences were reported - see for example this paper for more information. (Basically, it's a packed integer, with each bit being a flag indicating a difference - the 13th bit indicates a value difference, or 4096.
According to The Little SAS Book, SAS character data can be up to 2^(15)-1 in length.
Where does that 1 character go? Usually in floating point arithmetic, we reserve one byte for the sign of the floating point number. Is something similar happening for SAS character data?
I don't have a definite answer, but I have a supposition.
I think that the length of 32,767 is not related to the field itself; SAS stores all of its rows (in an uncompressed file) in identical sized blocks, and so there is no need for a field length indicator or a null terminator. IE, in a SAS dataset you would have something like, for the following data step equivalent:
data want;
length name $8;
input recnum name $ age;
datalines;
01 Johnny 13
02 Nancy 12
03 Rachel 14
04 Madison 12
05 Dennis 15
;;;;
run;
You'd have something like this. The headers are of course not written that way but are just packed sequences of bytes.
<dataset header>
Dataset name: Want
Dataset record size: 24 bytes
... etc. ...
<subheaders>
Name character type length=8
Recnum numeric type length=8
Age numeric type length=8
... etc. ...
<first row of data follows>
4A6F686E6E792020000000010000000D
4E616E6379202020000000020000000C
52616368656C2020000000030000000E
4D616469736F6E20000000040000000C
44656E6E69732020000000050000000F
<end of data>
The variables run directly into each other, and SAS knows where one starts and one stops from the information in the subheaders. (This is just a PUT statement of course; I think in the actual file the integers are stored first, if I remember correctly; but the idea is the same.)
Technically the .sas7bdat specification is not a publicly disclosed specification, but several people have worked out most of how the file format works. Some R programmers have written a specification which while a bit challenging to read does give some information.
It denotes that 4 bytes are used to specify the field length, which is more than enough for 32767 (it's enough for 2 billion), so this isn't the definite answer; I suppose it may have originally been 2 bytes and changed to 4 at some later point in the development of SAS, though .sas7bdat was a totally new filetype created relatively recently (version 7, hence sas7bdat; we're on 9 now).
Another possibility, and perhaps the more likely one, is that before 1999 the ANSI C standard only required C compilers to support objects to a minimum of 32767 bytes - meaning a compiler didn't have to support arrays larger than 32767 bytes. While many of them did support much larger arrays/objects, it's possible that SAS was working with the minimum standard to avoid issues with different OS and hardware implementations. See this discussion of the ANSI C standards for some background. It's also possible another language's limitations (as SAS uses several different ones) of a similar nature are at fault here. [Credit to FriedEgg for the beginning of this idea (offline).]
So, this should be an easy one, but I've always been garbage at contrasts, and the SAS literature isn't really helping. We are running an analysis, and we need to compare different combinations of variables. For example, we have 8 different breeds and 3 treatments, and want to contrast breed 5 against breed 7 at treatment 1. The code I have written is:
proc mixed data=data;
class breed treatment field;
model ear_mass = field breed field*breed treatment field*treatment breed*treatment;
random field*breed*treatment;
estimate "1 C0"
breed 0 0 0 0 1 0 -1 0 breed*treatment 0 0 0 0 1 0 0 0 -1 0 0;
run;
What exactly am I doing wrong in my estimate line that isn't working out?
Your contrast statement for this particular comparison must also include coefficients for breed*field.
When defining contrasts, I recommend starting small and building up. Write a contrast for breed 5 at time 1 (B5T1), and check its value against its lsmean to confirm that you've got the right coefficients. Note that you have to average over all field levels to get this estimate. Likewise, write a contrast for B7T1. Then subtract the coefficients for B5T1 from those for B7T1, noting that the coefficients for some terms (e.g., treatment*field) are now all zero.
An easier alternative is to use the LSMESTIMATE statement, which allows you to build contrasts using the lsmeans rather than the model parameters. See the documentation and this paper Kiernan et al., 2011, CONTRAST and ESTIMATE Statements Made Easy:The LSMESTIMATE Statement
Alas, you must tell SAS, it can't tell you.
You are right, it is easy to make an error. It is important to know the ordering of factor levels in the interaction, which is determined by the order of factors in the
CLASS statement. You can confirm the ordering by looking at the order of the interaction lsmeans in the LSMEANS table.
To check you can compute the estimate of the contrast by hand using the lsmeans. If it matches, then you can be confident that the standard error, and so the inferential test, are also correct.
The LSMESTIMATE is a really useful tool, faster and much less prone to error than defining contrasts using model parameters.
I have a proc discrim statement which runs a KNN analysis. When I set k = 1 then it assigns everything a category (as expected). But when k > 1 it leaves some observations unassigned (sets category as Other).
I'm assuming this is a result of deadlock votes for two or more of the categories. I know there are ways around this by either taking a random one of the deadlocked votes as the answer, or taking the nearest of the deadlocked votes as the answer.
Is this functionality available in proc discrim? How do you tell it how to deal with deadlocks?
Cheers!
Your assumption that the assignment of an observation to the "Other" class results from the same probability of assignment to two or more of the designated classes is correct when the number of nearest neighbors is two or more. You can see this by specifying the PROC DISCRIM statement option, OUT=SASdsn, to write a SAS output data set of how well the procedure classified the input observations. This output data set contains probabilities for assignment to each of the designated classes. For example, using two nearest neighbors (K=2) with the iris data set yields five observations that the procedure classifies as ambiguous, with a probability of 0.50 for being assigned to either the Versicolor or the Virginica class. From the output data set, you can select these ambiguously classified observations and assign them randomly to these classes in a subsequent DATA step. Or, you can compare the values of the variables used to classify these ambiguously classified observations to the means of these values for each of the classes, perhaps by calculating a squared distance +/- standardized by the standard deviation of each value and by assigning the observation to the "closest" class.