I need to compare (pairwise) observations in N data sets. All data sets have the same table, variable attributes. I care only if differences in observations are detected, and I need to know which two tables, and the specific differences. What's the most efficient way to do it? I have some thing below and I'd appreciate comments/suggestion.
I am currently using the following codes:
proc compare base = a compare = b
outnoequal outbase outcomp outdif noprint out = a_b_out;
run;
So for every data set a_b_out, I only care if it is not empty. SAS log can produce NOTE for this.
I am currently using macro to check if it is empty in each a_b_out, and conditionally output it to a permanent directory.
Is there a better (more efficient way)?
Sure, that's a reasonable way to do it. Put that in a macro, and have the macro both run the compare and check for the differences.
You can actually also use the system code after the PROC COMPARE is run to see what kind of difference was reported, if you want to; &SYSINFO. contains a value describing what kind of differences were reported - see for example this paper for more information. (Basically, it's a packed integer, with each bit being a flag indicating a difference - the 13th bit indicates a value difference, or 4096.
Related
Consider two variables sex and age, where sex is treated as categorical variable and age is treated as continuous variable. I want to specify the full factorial using the non-interacted age-term as baseline (the default for sex##c.age is to omit one of the sex-categories).
The best I could come up with, so far, is to write out the factorial manually (and omitting `age' from the regression):
reg y sex#c.age i.sex
This is mathematically equivalent to
reg y sex##c.age
but allows me to directly infer the regression coefficients (and standard errors!) on the interaction term sex*age for both genders.
Is there a way to stick to the more economic "##" notation but make the continuous variable the omitted category?
(I know the manual approach has little overhead notation in the example given here, but the overhead becomes huge when working with triple-interaction terms).
Not a complete answer, but a workaround: One can use `lincom' to obtain linear combinations of coefficients and standard errors. E.g., one could obtain the combined effects for both sexes as follows:
reg y sex##c.age
forval i = 1/2 {
lincom age + `i'.sex#c.age
}
I was summarizing a dataset with approximately 1 million rows by 5 category variables. I initially tried this with PROC MEANS but kept getting memory warnings and the process took forever. I ended up just using PROC SQL with a GROUP BY statement and the whole process only took around ~ 10 seconds. Why is their such a performance difference? My understanding was that PROC MEANS just creates an SQL statement in the background so I would assume the two methods would be very similar.
If you prefer the proc notation, you can also use proc summary instead of proc means or proc sql as it's almost identical to proc means but for some reason it uses less memory.
Not sure why this is the case but I know in the past I have avoided memory issues by switching code from proc means to proc summary.
I am running regressions with fixed effects in Stata, using areg, and I just realized its reports a constant term. Allegedly, what areg does is perform a "within" transformation to the data and then run OLS on the transformed data. However, the constant term from the original model is destroyed by the "within" transformation.
Therefore, what does the constant reported by areg mean? Is it a programming mistake? I don't think so, because areg does not allow the -nocons- option, and it would seem that the reason is connected to the meaning of the constant.
With areg (as well as xtreg with the fe option), the intercept is fit so that y-bar minus x-bar times beta-hat is equal to zero. In other words, Stata calculates an intercept that makes the prediction calculated at the means of the independent variables equal to the mean of the dependent variable.
As said in the Stata manual:
The intercept reported by areg deserves some explanation because, given k mutually exclusive and exhaustive dummies, it is arbitrary. areg identifies the model by choosing the intercept that makes the prediction calculated at the means of the independent variables equal to the mean of the dependent variable: y = xb.
areg does give a different constant than using regress.
areg is kind of tricky as it is based on the assumption that the (absorbed) no. of groups will not increase with sample size.
I have a proc discrim statement which runs a KNN analysis. When I set k = 1 then it assigns everything a category (as expected). But when k > 1 it leaves some observations unassigned (sets category as Other).
I'm assuming this is a result of deadlock votes for two or more of the categories. I know there are ways around this by either taking a random one of the deadlocked votes as the answer, or taking the nearest of the deadlocked votes as the answer.
Is this functionality available in proc discrim? How do you tell it how to deal with deadlocks?
Cheers!
Your assumption that the assignment of an observation to the "Other" class results from the same probability of assignment to two or more of the designated classes is correct when the number of nearest neighbors is two or more. You can see this by specifying the PROC DISCRIM statement option, OUT=SASdsn, to write a SAS output data set of how well the procedure classified the input observations. This output data set contains probabilities for assignment to each of the designated classes. For example, using two nearest neighbors (K=2) with the iris data set yields five observations that the procedure classifies as ambiguous, with a probability of 0.50 for being assigned to either the Versicolor or the Virginica class. From the output data set, you can select these ambiguously classified observations and assign them randomly to these classes in a subsequent DATA step. Or, you can compare the values of the variables used to classify these ambiguously classified observations to the means of these values for each of the classes, perhaps by calculating a squared distance +/- standardized by the standard deviation of each value and by assigning the observation to the "closest" class.
This is a theoretical question, so expect that many details here are not computable in practice or even in theory.
Let's say I have a string s that I want to compress. The result should be a self-extracting binary (can be x86 asm but can also be some other hypothetical Turing-complete low level language) which outputs s.
Now, we can easily iterate through all possible such binaries/programs, ordered by size. Let B_s be the sub-list of these binaries who output s (of course B_s is uncomputable).
As every set of positive integers must have a minimum, there must be a smallest program b_min_s in B_s
From s, I can also construct a canonical program b_cano_s which just outputs s in a trivial way. I.e. the size of b_cano_s will be in O(#s) -- if we think of ELF with data segments, we will even have #b_cano_s ~ #s.
Is there a set A of possible operations on the binaries which:
1 . Will preserve the output.
2a . Given b_cano_s, we can arrive somehow by operations from A at b_min_s.
(2b . Given b_cano_s, we can arrive at all programs in B_s.)
for all possible strings s.
The conditions 1+2a are weaker than the conditions 1+2b. Maybe, if there is such a set A, we will automatically have both, though. (Is that so?)
Does such a set A exists? I was thinking about some obvious operations, like searching for some repeated strings and shorten this. Or some of the other common compression methods. However, that probably is not enough to arrive at all programs B_s and my intention says also not necessarily at b_min_s for the same reason.
If it exists, can we express it, i.e. is it computable?
You should link your related previous questions.
2a. As noted, you can not determine b_min_s, because that results in a paradox. As a result, I don't think you can prove the operations in A are sufficient to reduce to it.
2b. You can brute force B_s, but this is an infinite set, and the procedure is non-terminating. However, for each program in B_s, you can calculate a manipulation from b_cano_s to B_s. However, that does not imply these possible operations will be meaningful. It seems operations like "delete characters in this range", "insert character at this position" qualify.