How to cluster questionnaire items together to create 1 variable? When I run multivariate linear model no output is given? what analyses should i use? - combinations

I am using SPSS!
I thought I had 2 IVs, motivations and values) and 1 DV (circular use of material)... But then a friend of mine extplained to me that I had to break em down. Motivations became: 6 types of different motivations (from amotivation to intrinsic motivation, so ordered) and values became 4 types of values (no order). Circular use is divided into ten types of circular use (from highest form of circularity to lowest form of circularity). ALSO, I have a moderating variable, homeownership: 4 types (no order)
All motivations have been measured with 7-point likert scale, all values with a 5-point likert scale and circularity with a 7-point likert scale. Homeownership has been measured with 2 dichotomous questions.
I have multiple questions.
For the 6 types of motivations I have 4 or 5 statements that have measured 1 type (in total 26 statements). How do I cluster these 4 statements in such a way that I can make 1 variable to use for the analyses? (Same for the values: 4 statements per type of value that I want to combine to create 4 types of values as 4 seperate variables to do the analyses with)
What analyses should I use than? Like I am typing it out right now I have 10 IVs and 10DVs and 1 moderating variable. And how to perform them?

Related

Implementing a constraint based on previous variable's value in GNU Mathprog/AMPL

I have a binary program and one of my variables, x_it is defined on two sets, being I: Set of objects and T: Set of the weeks of the year, thus x_it is a binary variable standing for whether object i is assigned to something on week t. The constraint I failed to implement in AMPL/GNU Mathprog is that if x_it equals to 1 then x_i(t+1) and x_i(t+2) also should take value of 1. Is there a way to implement this constraint in a simple mathematical programming language?
The implication you want to implement is:
x(i,t) = 1 ==> x(i,t+1) = 1, x(i,t+2) = 1
AMPL supports implications (with the ==> operator), so we can write this directly. MathProg does not.
A simple way to implement the implication as straightforward linear inequalities is:
x(i,t+1) >= x(i,t)
x(i,t+2) >= x(i,t)
This can easily be expressed in AMPL, MathProg, or any modeling tool.
This is the pure, naive translation of the question. This means however that once a single x(i,t)=1 all following x(i,t+1),x(i,t+2),x(i,t+3)..=1. That could have been accomplished by just the constraint x(i,t+1) >= x(i,t).
A better interpretation would be: we don't want very short run lengths. I.e. patterns: 010 and 0110 are not allowed. This is sometimes called a minimum up-time in machine scheduling and can be modeled in different ways.
Forbid the patterns 010 and 0110:
(1-x(i,t-1))+x(i,t)+(1-x(i,t+1)) <= 2
(1-x(i,t-1))+x(i,t)+x(i,t+1)+(1-x(i,t+2)) <= 3
The pattern 01 implies 0111:
x(i,t+1)+x(i,t+2) >= 2*(x(i,t)-x(i,t-1))
Both these approaches will prevent patterns 010 and 0110 to occur.

Comparing variable combinations using contrast or estimate in SAS

So, this should be an easy one, but I've always been garbage at contrasts, and the SAS literature isn't really helping. We are running an analysis, and we need to compare different combinations of variables. For example, we have 8 different breeds and 3 treatments, and want to contrast breed 5 against breed 7 at treatment 1. The code I have written is:
proc mixed data=data;
class breed treatment field;
model ear_mass = field breed field*breed treatment field*treatment breed*treatment;
random field*breed*treatment;
estimate "1 C0"
breed 0 0 0 0 1 0 -1 0 breed*treatment 0 0 0 0 1 0 0 0 -1 0 0;
run;
What exactly am I doing wrong in my estimate line that isn't working out?
Your contrast statement for this particular comparison must also include coefficients for breed*field.
When defining contrasts, I recommend starting small and building up. Write a contrast for breed 5 at time 1 (B5T1), and check its value against its lsmean to confirm that you've got the right coefficients. Note that you have to average over all field levels to get this estimate. Likewise, write a contrast for B7T1. Then subtract the coefficients for B5T1 from those for B7T1, noting that the coefficients for some terms (e.g., treatment*field) are now all zero.
An easier alternative is to use the LSMESTIMATE statement, which allows you to build contrasts using the lsmeans rather than the model parameters. See the documentation and this paper Kiernan et al., 2011, CONTRAST and ESTIMATE Statements Made Easy:The LSMESTIMATE Statement
Alas, you must tell SAS, it can't tell you.
You are right, it is easy to make an error. It is important to know the ordering of factor levels in the interaction, which is determined by the order of factors in the
CLASS statement. You can confirm the ordering by looking at the order of the interaction lsmeans in the LSMEANS table.
To check you can compute the estimate of the contrast by hand using the lsmeans. If it matches, then you can be confident that the standard error, and so the inferential test, are also correct.
The LSMESTIMATE is a really useful tool, faster and much less prone to error than defining contrasts using model parameters.

Could PCA be used to combine multiple rankings?

I have n (in my case just 9) different ranking of the same items. Now, I'm trying to find a combination using PCA (Principal Component Analysis), in order to improve the accuracy of my ranking. The method should be unsupervised, that is I'd like to generate new ranking based.
My idea is to try all the possible subset (without repetitions) of the 9 different ranking and run PCA for every of that. There for I will come out with 501 different new ranking (in the case of n=9). With different subset I obtain better results.
When I say better results I mean that I have the true ranking and when I finish the combination I compare the result of all the ranking (combined and the original 9).
Is this method makes sense?
Your question involves a subset of voting theory and there are many possibilities on how to solve this. Some of the techniques are more flexible than others. For example, some techniques can accomodate ordered rankings of variable sizes (imagine one ranking only contained 5 ordered items, while the other rankings contained 9 ordered items) while others cannot. Some techniques can assign variable weights to the different reviewers. Netflix has very complex proprietary algorithms they use to combine multiple users' movie rankings into overall rankings. That being said, I would say your combinatorial PCA approach does not strike me as either computationally efficient or terribly relevant. If you are taking information from only a subset of your 9 rankings, you are potentially discarding useful (although perhaps subtle) information.
Schulze method: Somewhat complex, but widely regarded as one of the best ways to pick a single winner from a set of rankings. Could be applied iteratively or otherwise modified to get an ordered list of winners.
Borda count: Several variations, all of which are quite simple and intuitive and usually lead to reasonable results.
Perhaps the biggest problem with the Borda count is that it does not adaquately handle the different standard deviations of two items that may have very similar average rankings. If we constrain ourselves to the subset of problems where all ordered rankings are of the same size and all have equal weight, I can recommend one method that is intuitive and leads to very good results across a wide range of cases: Aggregate Z-score Minimization. This works as follows:
For each one of the ranked items, compute the mean μ and standard deviation σ of its rankings (assume a Gaussian distribution).
Next compute the |z-score| "distance" matrix for every item to every possible ranking position. Z-score = (proposed ranking position - μ) / σ
Then exhaustively calculate which set of ranking positions give the lowest aggregate (total) z-score distance.
Effectively, the ranking problem is converted into a classification problem where we are trying to classify each ranking position into the best fitting sampled distribution of each item. The constraint is that only one ranking position can be assigned to each Gaussian item distribution. By minimizing the aggregate z-score distance globally, we are finding the most statistically likely configuration for the "true" ranking.
If you don't want to do the programming to exhaustively calculate the combinatorial sums of step 3, there is a heuristic method I'll demonstrate here that usually gives good results (but not guaranteed to be the best solution).
Consider we have 4 independent rankings of 6 items (A-F) here. Assume the first item listed in each ranking is at ranking position #1:
1. A,C,F,E,B,D
2. D,B,C,E,F,A
3. F,A,B,C,D,E
4. E,A,C,B,D,F
Next compute the mean and standard deviation of each item's ranking positions:
A: (#1, #6, #2, #2); μ = 2.75, σ = 2.217
B: μ = 3.5, σ = 1.291
C: μ = 3.0, σ = 0.816
D: μ = 4.25, σ = 2.217
E: μ = 3.75, σ = 2.062
F: μ = 3.75, σ = 2.217
We can see from the relatively tight spread of means (2.75 to 4.25) that all of the items are contending for about the same average, middle positions. This is a case where the Borda count may tend to perform poorly because the standard deviations become extra important when the averages are all so close. So next, we create the matrix of z-score distances from each item to each possible ranking position:
A: 0.7892, 0.3382, 0.1127, 0.5637, 1.0147, 1.4657
B: 1.9365, 1.1619, 0.3873, 0.3873, 1.1619, 1.9365
C: 2.4495, 1.2247, 0.0000, 1.2247, 2.4495, 3.6742
D: 1.4657, 1.0147, 0.5637, 0.1127, 0.3382, 0.7892
E: 1.3339, 0.8489, 0.3638, 0.1213, 0.6063, 1.0914
F: 1.2402, 0.7892, 0.3382, 0.1127, 0.5637, 1.0147
It's probably obvious, but in the event you had any item with σ = 0, you can assign that item to its exclusive ranking position immediately. Now if you don't want to exhaustively solve this matrix for the ranking combination with the lowest possible aggregate z-score assignment, you can use this heuristic. Sum each column, and then subtract the minimum value from that column to get a value we can call "savings":
sum: 9.2151, 5.3777, 1.7658, 2.5225, 6.1344, 9.9718
min: 0.7892, 0.3382, 0.0000, 0.1127, 0.3382, 0.7892
savings: 8.4259, 5.0395, 1.7658, 2.4098, 5.7962, 9.1826
Take the column with the largest "savings" value and assign the item with the min value to that position. In our example here, this means we will assign the item "D" to the 6th position. After doing this, recompute the sum, min, and savings values, but first remove the "D" item's row and also remove the 6th column (because they have already been assigned). Then assign the new largest "savings" value to the item with the min value in that column. Continue until all rankings are assigned. In this example, the final (heuristic) ranking will be as follows: A, E, C, B, F, D (aggregate z-score: 3.3783). I didn't check my work, but it looks like the exhaustively solved solution of A, F, C, B, E, D (aggregate z-score: 3.3612) might be 0.5% better than the heuristic solution.
It's worth noting that the naive solution where we just simply ordered the means A, C, B, E, F, D (aggregate z-score: 3.8754) is substantially less likely (statistically) to be the best ranking.

How do you handle deadlocks using Proc Discrim in SAS for KNN?

I have a proc discrim statement which runs a KNN analysis. When I set k = 1 then it assigns everything a category (as expected). But when k > 1 it leaves some observations unassigned (sets category as Other).
I'm assuming this is a result of deadlock votes for two or more of the categories. I know there are ways around this by either taking a random one of the deadlocked votes as the answer, or taking the nearest of the deadlocked votes as the answer.
Is this functionality available in proc discrim? How do you tell it how to deal with deadlocks?
Cheers!
Your assumption that the assignment of an observation to the "Other" class results from the same probability of assignment to two or more of the designated classes is correct when the number of nearest neighbors is two or more. You can see this by specifying the PROC DISCRIM statement option, OUT=SASdsn, to write a SAS output data set of how well the procedure classified the input observations. This output data set contains probabilities for assignment to each of the designated classes. For example, using two nearest neighbors (K=2) with the iris data set yields five observations that the procedure classifies as ambiguous, with a probability of 0.50 for being assigned to either the Versicolor or the Virginica class. From the output data set, you can select these ambiguously classified observations and assign them randomly to these classes in a subsequent DATA step. Or, you can compare the values of the variables used to classify these ambiguously classified observations to the means of these values for each of the classes, perhaps by calculating a squared distance +/- standardized by the standard deviation of each value and by assigning the observation to the "closest" class.

n-dimensional interpolation c++ algorithm

How can I implement n-dimensional interpolation in C++? In ideal case I would like to have it generic on actual kernel so that I can switch between e.g., linear and polynomial interpolation (perhaps as a start: linear interpolation). This article ( http://pimiddy.wordpress.com/2011/01/20/n-dimensional-interpolation/ ) discusses this stuff but I have two problems:
1) I could not understand how to implement the "interpolate" method shown in the article in C++
2) More importantly I want to use it in a scenario where you have "multiple independent variables (X)" and "1 dependent variable (Y)" and somehow interpolate on both (?)
For example, if n=3 (i.e. 3-dimensional) and I have the following data:
#X1 X2 X3 Y
10 10 10 3.45
10 10 20 4.52
10 20 15 5.75
20 10 15 5.13
....
How could I know value of Y (dependent variable) for a particular combination of X (independent variables): 17 17 17
I know there exists other ways such as decision trees and SVM but here I am here interested in interpolation.
You can take a look at a set of interpolation alrogithms (including C++ implementation) at alglib.
Also it should be noted that neural networks (backpropagation nets, for example) are treated as good interpolators.
If your question is about the specific article, it's out of my knowledge.