weighted PCA with FactomineR shows p.values = 0 - pca

I have a problem concerning p.values when doing a PCA with the FactoMineR package in R. Note that the rows get weighted for the PCA (row.w), but in this case the p.values shown are all zero using the command:
res = dimdesc(res.mca, axes=1:2, proba=0.05)
So when I don’t use row weight, and want to see p.values, they all look “normal”.
What am I missing? Why are there no p.values when I use a row weight?
Example with row weight:
asyFinanc
correlation 0.7561609
p.value 0
Without row weight:
asyTransp
correlation 0.6899174
p.value 1.138453e-21

Got the answer from the maker of the FactoMineR package, apparently one needs to divide the weight by the sum of itself to get p-values for a weighted PCA:
data(decathlon)
res = PCA(decathlon[,1:10],row.w=(1:41)/sum(1:41))
dimdesc(res)

Related

Dax - Reverse margin calculation on a slider slicer

I have a measure that calculates the reverse margin following this principle:
RMS=C/(1−(MP/100))
Where RMS is the Reverse Margin Sell Price (£)
C is the cost of the product (£)
MP is the margin percentage (%)
The DAX measure itself looks like this:
_rms20% =
VAR newGP = DIVIDE([_cost], (1-20/100))
RETURN
IF([_%currentGP]>0.2, BLANK(), newGP)
So if the current GP percentage is higher than 20% blank space is returned, if it is lower I have the RMS calculation returned.
This works nicely but problem occurs when I create a "What-IF" parameter slider as follows:
Increase = GENERATESERIES(0, 100, 1)
Increase Value = SELECTEDVALUE('Increase'[Increase])
and use this parameter with the RMS measure:
_slider% = [_rms20%] * (100+'Increase'[Increase Value])/100
For example if my Cost is £27.26 and desired gp is 20% than cost has to be increased to £34.08 - this is done by the basic calculation following the principle mentioned above.
If I put this on a slider and increase it by 5 to 25% the value it shows is £35.78 while in fact it should be £36.34.
I have been trying to fix this for some time now so any advice/recommendation would be very appreciated.

How do I get two coefficients from a set of regressions plotted on the same chart?

I am estimating a model in Stata 16 over several subsamples. I want a chart comparing two coefficients of interest over the different subsamples, with axis labels showing which subsample it comes from.
Is there a way to combine both of these on the same panel, with the mileage estimates in one colour and the trunk space in another?
The closest I can get using coefplot is a tiled plot with a set of coefficients of one variable in one panel, and the coefficients for the other variable in another panel (see toy example below). Any idea how to get both on the same panel?
webuse auto
forval v=2/5 {
reg price trunk mpg if rep78==`v'
est store reg_`v'
}
coefplot reg_2 || reg_3 || reg_4 || reg_5, keep(trunk mpg) bycoefs vertical
There's likely a more elegant way to do this with coefplot, but until someone posts that solution: you can use matrices to brute force coefplot into behaving the way you'd like. Specifically, define as many matrices as you have unique covariates, with each matrix's dimension being #specs x 3. Each row will contain the covariate's estimated coefficient, lower CI, and upper CI for a particular model specification.
This works because coefplot assigns the same color to all quantities associated with plot (as defined by coefplot's help file). plot is usually a stored model from estimates store, but by using the matrix trick, we've shifted plot to be equivalent to a specific covariate, giving us the same color for a covariate across all the model specifications. coefplot then looks to the matrix's rows to find its "categorical" information for the labeled axis. In this case, our matrix's rows correspond to a stored model, giving us the specification for our axis labels.
// (With macros for the specification names + # of coefficient
// matrices, for generalizability)
clear *
webuse auto
// Declare model's covariates
local covariates trunk mpg
// Estimate the various model specifs
local specNm = "" // holder for gph axis labels
forval v=2/5 {
// Estimate the model
reg price `covariates' if rep78==`v'
// Store specification's name, for gph axis labels
local specNm = "`specNm' reg_`v'"
// For each covariate, pull its coefficient + CIs for this model, then
// append that row vector to a new matrix containing that covariate's
// b + CIs across all specifications
matrix temp = r(table)
foreach x of local covariates{
matrix `x' = nullmat(`x') \ (temp["b","`x'"], temp["ll","`x'"], temp["ul","`x'"])
}
}
// Store the list of 'new' covariate matrices, along w/the
// column within this matrix containing the coefficients
global coefGphList = ""
foreach x of local covariates{
matrix rownames `x' = `specNm'
global coefGphList = "$coefGphList matrix(`x'[,1])"
}
// Plot
coefplot $coefGphList, ci((2 3)) vertical

Circular dependency while calculating column

I have a single table of data named RDSLPDSL. I am trying to calculate two columns based on two measures I am creating from the table.
Count of RDSL Marker for 1 =
CALCULATE(
COUNT('RDSLPDSL'[RDSL Marker]),
'RDSLPDSL'[RDSL Marker] IN { 1 }
)
I am using the above code as a measure to look for values only with 1 in it in the RDSL Marker column.
RDSL % = DIVIDE([Count of RDSL Marker for 1], COUNTROWS(RDSLPDSL))
Then I created a column using the above code to divide the rows with 1 by the total number of rows in the table.
I am doing the same for another column with PDSL. It is as follows:
Count of PDSL Marker for 1 =
CALCULATE(
COUNT('RDSLPDSL'[PDSL Marker]),
'RDSLPDSL'[PDSL Marker] IN { 1 }
)
PDSL % = DIVIDE([Count of PDSL Marker for 1], COUNTROWS(RDSLPDSL))
But when I do this calculation, I am getting an error for circular dependency detected and not getting the final output even though the same code worked for the previous column.
I tried COUNTAX directly instead of using CALCULATE but that brings up the same error too.
I also tried using measures instead of custom column which seems to remove the error but the output is not what I expect and is incorrect.
Any help for the same would be highly appreciated.

Divide the testing set into subgroup, then make prediction on each subgroup separately

I have a dataset similar to the following table:
The prediction target is going to be the 'score' column. I'm wondering how can I divide the testing set into different subgroups such as score between 1 to 3 or then check the accuracy on each subgroup.
Now what I have is as follows:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = tree.DecisionTreeRegressor()
model.fit(X_train, y_train)
for i in (0,1,2,3,4):
y_new=y_test[(y_test>=i) & (y_test<=i+1)]
y_new_pred=model.predict(X_test)
print metrics.r2_score(y_new, y_new_pred)
However, my code did not work and this is the traceback that I get:
Found input variables with inconsistent numbers of samples: [14279,
55955]
I have tried the solution provided below, but it looks like that for the full score range (0-5) the r^2 is 0.67. but the subscore range for example (0-1,1-2,2-3,3-4,4-5) the r^2s are significantly lower than that of the full range. shouldn't some of the subscore r^2 be higher than 0.67 and some of them be lower than 0.67?
Could anyone kindly let me know where did I do wrong? Thanks a lot for all your help.
When you are computing the metrics, you have to filtered the predicted values (based on your subset condition).
Basically you are trying to compute
metrics.r2_score([1,3],[1,2,3,4,5])
which creates an error,
ValueError: Found input variables with inconsistent numbers of
samples: [2, 5]
Hence, my suggested solution would be
model.fit(X_train, y_train)
#compute the prediction only once.
y_pred = model.predict(X_test)
for i in (0,1,2,3,4):
#COMPUTE THE CONDITION FOR SUBSET HERE
subset = (y_test>=i) & (y_test<=i+1)
print metrics.r2_score(y_test [subset], y_pred[subset])

Determining log_perplexity using ldamulticore for optimum number of topics

I am trying to determine the optimum number of topics for my LDA model using log perplexity in python. That is, I am graphing the log perplexity for a range of topics and determining the minimum perplexity. However, the graph I have obtained has negative values for log perplexity, when it should have positive values between 0 and 1.
#calculating the log perplexity per word as obtained by gensim code
##https://radimrehurek.com/gensim/models/atmodel.html
#parameters: pass in trained corpus
#return: graph of perplexity per word for varying number of topics
parameter_list = range(1, 500, 100)
grid ={}
for parameter_value in parameter_list:
model = models.LdaMulticore(corpus=corpus, workers=None, id2word=None,
num_topics=parameter_value, iterations=10)
grid[parameter_value]=[]
perplex=model.log_perplexity(corpus, total_docs=len(corpus))
grid[parameter_value].append(perplex)
df = pd.DataFrame(grid)
ax = plt.figure(figsize=(7, 4), dpi=300).add_subplot(111)
df.iloc[0].transpose().plot(ax=ax, color="#254F09")
plt.xlim(parameter_list[0], parameter_list[-1])
plt.ylabel('Perplexity')
plt.xlabel('topics')
plt.show()
The perplexity must be between 0 and 1. What you are computing is the log-perplexity. It is negative because the logarithm of a number in the (0,1) range is below zero.