How could i use 3 random variables instead of one with shape=3 with pymc3 - pymc3

i am new to pymc3. I am trying to implement an unpooled model with 3 random variables (b0,b1,b2 ~ Normal) instead of one (b) with shape= 3. The code is from the book Bayesian Modeling and Computation in Python. I think it will help me implement a more complex multilevel model.
customers = sales_df.loc[:, "customers"].values
sales_observed = sales_df.loc[:, "sales"].values
food_category = pd.Categorical(sales_df["Food_Category"])
with pm.Model() as model_sales_unpooled:
s = pm.HalfNormal("s", 20, shape=3)
b = pm.Normal("b", mu=10, sigma=10, shape=3)
m = pm.Deterministic("m", b[food_category.codes] *customers)
sales = pm.Normal("sales", mu=m, sigma=s[food_category.codes],
observed=sales_observed)

Related

k-fold cross validation: how to filter data based on a randomly generated integer variable in Stata

The following seems obvious, yet it does not behave as I would expect. I want to do k-fold cross validation without using SCC packages, and thought I could just filter my data and run my own regressions on the subsets.
First I generate a variable with a random integer between 1 and 5 (5-fold cross validation), then I loop over each fold number. I want to filter the data by the fold number, but using a boolean filter fails to filter anything. Why?
Bonus: what would be the best way to capture all of the test MSEs and average them? In Python I would just make a list or a numpy array and take the average.
gen randint = floor((6-1)*runiform()+1)
recast int randint
forval b = 1(1)5 {
xtreg c.DepVar /// // training set
c.IndVar1 ///
c.IndVar2 ///
if randint !=`b' ///
, fe vce(cluster uuid)
xtreg c.DepVar /// // test set, needs to be performed with model above, not a
c.IndVar1 /// // new model...
c.IndVar2 ///
if randint ==`b' ///
, fe vce(cluster uuid)
}
EDIT: Test set needs to be performed with model fit to training set. I changed my comment in the code to reflect this.
Ultimately the solution to the filtering issue was I was using a scalar in quotes to define the bounds and I had:
replace randint = floor((`varscalar'-1)*runiform()+1)
instead of just
replace randint = floor((varscalar-1)*runiform()+1)
When and where to use the quotes in Stata is confusing to me. I cannot just use varscalar in a loop, I have to use `=varscalar', but I can for some reason use varscalar - 1 and get the expected result. Interestingly, I cannot use
replace randint = floor((`varscalar')*runiform()+1)
I suppose I should just use
replace randint = floor((`=varscalar')*runiform()+1)
So why is it ok to use the version with the minus one and without the equals sign??
The answer below is still extremely helpful and I learned much from it.
As a matter of fact, two different things are going on here that are not necessarily directly related. 1) How to filter data with a randomly generated integer value and 2) k-fold cross-validation procedure.
For the first one, I will leave an example below that could help you work things out using Stata with some tools that can be easily transferable to other problems (such as matrix generation and manipulation to store the metrics). However, I would call neither your sketch of code nor my example "k-fold cross-validation", mainly because they fit the model, both in the testing and in training data. Nonetheless, the case should be that strictly speaking, the model should be trained in the training data, and using those parameters, assess the performance of the model in testing data.
For further references on the procedure Scikit-learn has done brilliant work explaining it with several visualizations included.
That being said, here is something that could be helpful.
clear all
set seed 4
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
gen byte randint = runiformint(1, 5)
tab randint
/*
randint | Freq. Percent Cum.
------------+-----------------------------------
1 | 17 17.00 17.00
2 | 18 18.00 35.00
3 | 21 21.00 56.00
4 | 19 19.00 75.00
5 | 25 25.00 100.00
------------+-----------------------------------
Total | 100 100.00
*/
// create a matrix to store results
matrix res = J(5,4,.)
matrix colnames res = "R2_fold" "MSE_fold" "R2_hold" "MSE_hold"
matrix rownames res ="1" "2" "3" "4" "5"
// show formated empty matrix
matrix li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 . . . .
2 . . . .
3 . . . .
4 . . . .
5 . . . .
*/
// loop over different samples
forvalues b = 1/5 {
// run the model using fold == `b'
qui reg y x1 x2 if randint ==`b'
// save R squared training
matrix res[`b', 1] = e(r2)
// save rmse training
matrix res[`b', 2] = e(rmse)
// run the model using fold != `b'
qui reg y x1 x2 if randint !=`b'
// save R squared training (?)
matrix res[`b', 3] = e(r2)
// save rmse testing (?)
matrix res[`b', 4] = e(rmse)
}
// Show matrix with stored metrics
mat li res
/*
res[5,4]
R2_fold MSE_fold R2_hold MSE_hold
1 .50949187 1.2877728 .74155365 1.0070531
2 .89942838 .71776458 .66401888 1.089422
3 .75542004 1.0870525 .68884359 1.0517139
4 .68140328 1.1103964 .71990589 1.0329239
5 .68816084 1.0017175 .71229925 1.0596865
*/
// some matrix algebra workout to obtain the mean of the metrics
mat U = J(rowsof(res),1,1)
mat sum = U'*res
/* create vector of column (variable) means */
mat mean_res = sum/rowsof(res)
// show the average of the metrics acros the holds
mat li mean_res
/*
mean_res[1,4]
R2_fold MSE_fold R2_hold MSE_hold
c1 .70678088 1.0409408 .70532425 1.0481599
*/

0 DF in regression in SAS enterprise guide

I created dummies in SAS (part of the codes below) and run regression (threw away M23). It was working fine. But then I tried to group them by age since we don't have enough members. I ran it the same way and threw away one age group (M20to24 since this group has the highest membership). Now some of my variables have 0 DF. Does anyone know what went wrong?
I got the message - Note: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. The following parameters have been set to 0, since the variables are a linear combination of other variables as shown.
data Table;
set Table;
M0=(AgeGender = '0M');
M1=(AgeGender = '1M');
M2=(AgeGender = '2M');
M3=(AgeGender = '3M');
M4=(AgeGender = '4M');
M5to9=(AgeGender = ' 5to9M');
M10to14=(AgeGender = '10to14M');
M15to19=(AgeGender = '15to19M');
M20to24=(AgeGender = '20to24M');
M25to29=(AgeGender = '25to29M');
M30to34=(AgeGender = '30to34M');
M35to39=(AgeGender = '35to39M');
M40to44=(AgeGender = '40to44M');
M45to49=(AgeGender = '45to49M');
M50to54=(AgeGender = '50to54M');
M55to59=(AgeGender = '55to59M');
M60to64=(AgeGender = '60to64M');
M65Plus=(AgeGender = '65+M');
F0=(AgeGender = '0F');
F1=(AgeGender = '1F');
F2=(AgeGender = '2F');
F3=(AgeGender = '3F');
F4=(AgeGender = '4F');
F5to9=(AgeGender = ' 5to9F');
F10to14=(AgeGender = '10to14F');
F15to19=(AgeGender = '15to19F');
F20to24=(AgeGender = '20to24F');
F25to29=(AgeGender = '25to29F');
F30to34=(AgeGender = '30to34F');
F35to39=(AgeGender = '35to39F');
F40to44=(AgeGender = '40to44F');
F45to49=(AgeGender = '45to49F');
F50to54=(AgeGender = '50to54F');
F55to59=(AgeGender = '55to59F');
F60to64=(AgeGender = '60to64F');
F65Plus=(AgeGender = '65+F');
Dep = (Relationship = 'Dep');
Mandatory = (Mand_Vo = 'Mandatory');
run;
ods output ParameterEstimates=Parameter_Estimates;
proc reg data= Table;
model logPMPM =
M0
M1
M2
M3
M4
M5to9
M10to14
M15to19
M25to29
M30to34
M35to39
M40to44
M45to49
M50to54
M55to59
M60to64
M65Plus
F0
F1
F2
F3
F4
F5to9
F10to14
F15to19
F20to24
F25to29
F30to34
F35to39
F40to44
F45to49
F50to54
F55to59
F60to64
F65Plus;
weight Membership;
run;
ods output close;
It doesn't look like you have overlaps or identical complimentary data variables but that's by definition. Your data is likely having that occur by chance, which is harder to find. You can likely find this by crossing variables that you suspect may be related or doing a pair wise scatter plot (PROC SGSCATTER) and seeing which two overlap almost identically.
You're correct, you wouldn't get this behaviour with continuous values because they're continuous and less likely to overlap exactly. In general, it's considered best practice to NOT categorize/bin variables when you can keep them continuous. The boundaries are artificial, does a 34 year old really differ from that 36 year old? What if all the people in that age group are 34 compared to the 36 in the 35 to 39 age group? You may not find a difference, but if your distribution was everyone at 39 vs everyone at 31 you may find more of a difference. Keeping the data continuous avoids these manufactured issues.

How to build an inflation term structure in QuantLib?

This is what I've got, but I'm getting weird results. Can you spot an error?:
#Zero Coupon Inflation Indexed Swap Data
zciisData = [(ql.Date(18,4,2020), 1.9948999881744385),
(ql.Date(18,4,2021), 1.9567999839782715),
(ql.Date(18,4,2022), 1.9566999673843384),
(ql.Date(18,4,2023), 1.9639999866485596),
(ql.Date(18,4,2024), 2.017400026321411),
(ql.Date(18,4,2025), 2.0074000358581543),
(ql.Date(18,4,2026), 2.0297999382019043),
(ql.Date(18,4,2027), 2.05430006980896),
(ql.Date(18,4,2028), 2.0873000621795654),
(ql.Date(18,4,2029), 2.1166999340057373),
(ql.Date(18,4,2031), 2.152100086212158),
(ql.Date(18,4,2034), 2.18179988861084),
(ql.Date(18,4,2039), 2.190999984741211),
(ql.Date(18,4,2044), 2.2016000747680664),
(ql.Date(18,4,2049), 2.193000078201294)]
def build_inflation_term_structure(calendar, observationDate):
dayCounter = ql.ActualActual()
yTS = build_yield_curve()
lag = 3
fixing_date = calendar.advance(observationDate,-lag, ql.Months)
convention = ql.ModifiedFollowing
cpiTS = ql.RelinkableZeroInflationTermStructureHandle()
inflationIndex = ql.USCPI(False, cpiTS)
#last observed CPI level
fixing_rate = 252.0
baseZeroRate = 1.8
inflationIndex.addFixing(fixing_date, fixing_rate)
observationLag = ql.Period(lag, ql.Months)
zeroSwapHelpers = []
for date,rate in zciisData:
nextZeroSwapHelper = ql.ZeroCouponInflationSwapHelper(rate/100,observationLag,date,calendar,
convention,dayCounter,inflationIndex)
zeroSwapHelpers = zeroSwapHelpers + [nextZeroSwapHelper]
# the derived inflation curve
derived_inflation_curve = ql.PiecewiseZeroInflation(observationDate, calendar, dayCounter, observationLag,
inflationIndex.frequency(), inflationIndex.interpolated(),
baseZeroRate, yTS, zeroSwapHelpers,
1.0e-12, ql.Linear())
cpiTS.linkTo(derived_inflation_curve)
return inflationIndex, derived_inflation_curve, cpiTS, yTS
observation_date = ql.Date(17, 4, 2019)
calendar = ql.UnitedStates()
inflationIndex, derived_inflation_curve, cpiTS, yTS = build_inflation_term_structure(calendar, observation_date)
If I plot the inflationIndex zero rates, I get this:
I've now been checking the same problem and first of all I don't think you need that function ZeroCouponInflationSwapHelper at all.
Just to let you know I've perfectly replicated OpenRiskEngine (ORE)'s inflation curve which itself is built on QuantLib and the idea is quite simple.
compute base value I(0) which is the lagged CPI (this is supposedly your value 252.0),
compute CPI(t)=I(T)=I(0)*(1+quote)^T, where T = ZCIS expiry date, t is T - 3M
I'm not sure where you took I(0) from, but since a lag is present the I(0) is not the latest available CPI value. Instead I(0)=I(2019-04-17) is interpolated value of CPI(2019-01) and CPI(2019-02).
Also to build the CPI curve you don't need any interest rate yield curve at all because the ZCIS pays at maturity T a single cash flow: 'floating' (I(T)/I(0)-1) for 'fixed' (1+quote)^T-1 cash flows. If you equate these, you can back out the 'fair' I(T) which I used above.
Assuming your value I(0)=252.0 is correct, your CPI curve would look like this:
import QuantLib as ql
import pandas as pd
fixing_rate = 252.0
observation_date = ql.Date(17, 4, 2019)
zciisData = [(ql.Date(18,4,2020), 1.9948999881744385),
(ql.Date(18,4,2021), 1.9567999839782715),
(ql.Date(18,4,2022), 1.9566999673843384),
(ql.Date(18,4,2023), 1.9639999866485596),
(ql.Date(18,4,2024), 2.017400026321411),
(ql.Date(18,4,2025), 2.0074000358581543),
(ql.Date(18,4,2026), 2.0297999382019043),
(ql.Date(18,4,2027), 2.05430006980896),
(ql.Date(18,4,2028), 2.0873000621795654),
(ql.Date(18,4,2029), 2.1166999340057373),
(ql.Date(18,4,2031), 2.152100086212158),
(ql.Date(18,4,2034), 2.18179988861084),
(ql.Date(18,4,2039), 2.190999984741211),
(ql.Date(18,4,2044), 2.2016000747680664),
(ql.Date(18,4,2049), 2.193000078201294)]
fixing_dates = []
CPI_computed = []
for tenor, quote in zciisData:
fixing_dates.append(tenor - ql.Period('3M')) # this is 'fixing date' t
pay_date = ql.ActualActual().yearFraction(observation_date, tenor) # this is year fraction of 'pay date' T
CPI_computed.append(fixing_rate * (1+quote/100)**(pay_date))
results = pd.DataFrame({'date': pd.Series(fixing_dates).apply(ql.Date.to_date), 'CPI':CPI_computed})
display(results)
results.set_index('date')['CPI'].plot();

Log base 2 calculation in python

I am trying to calculate the average disorder in ID trees. My code is below:
Republican_yes = yes.count('Republican')
Democrat_yes = yes.count('Democrat')
Republican_no = no.count('Republican')
Democrat_no = no.count('Democrat')
Indep_yes = yes.count('Independent')
Indep_no = no.count('Independent')
disorder_yes= Republican_yes/len(yes)*(math.log(float(Republican_yes)/len(yes),2))+ Democrat_yes/len(yes)*(math.log(float(Democrat_yes)/len(yes),2))+Indep_yes/len(yes)*(math.log(float(Indep_yes)/len(yes),2))
disorder_no= Republican_no/len(no)*(math.log(float(Republican_no)/len(no),2))+Democrat_no/len(no)*(math.log(float(Democrat_no)/len(no),2))+Indep_no/len(no)*(math.log(float(Indep_no)/len(no),2))
avgdisorder = -len(yes)/(len(yes)+len(no))*disorder_yes - len(no)/(len(yes)+len(no))*disorder_no
return avgdisorder
why do I keep getting math domain error?
Check if the lengths are 0 or not, else you will get MathError.
if len(yes):
disorder_yes= Republican_yes/len(yes)*(math.log(float(Republican_yes)/len(yes),2))+ Democrat_yes/len(yes)*(math.log(float(Democrat_yes)/len(yes),2))+Indep_yes/len(yes)*(math.log(float(Indep_yes)/len(yes),2))
if len(no):
disorder_no= Republican_no/len(no)*(math.log(float(Republican_no)/len(no),2))+Democrat_no/len(no)*(math.log(float(Democrat_no)/len(no),2))+Indep_no/len(no)*(math.log(float(Indep_no)/len(no),2))
if len(yes) or len(no):
avgdisorder = -len(yes)/(len(yes)+len(no))*disorder_yes - len(no)/(len(yes)+len(no))*disorder_no
If you want, you can always add the else clause for all 3 if statements as per your requirement.

implementing Bags of Words object recognition using VLFEAT

I am trying to implement a BOW object recognition code in matlab. The process is slightly complicated and I've had a lot of trouble finding proper documentation on the procedure. So could someone double check if my plan below makes sense?
I'm using the VLSIFT library extensively here
Training:
1. Extract SIFT image descriptor with VLSIFT
2. Quantize the descriptors with k-means(vl_hikmeans)
3. Take quantized descriptors and create histogram(VL_HIKMEANSHIST)
4. Create SVM from histograms(VL_PEGASOS?)
I understand step 1-3, but I'm not quite sure if the function for SVM is correct.
VL_PEGASOS takes the following:
W = VL_PEGASOS(X, Y, LAMBDA)
How exactly do I use this function with the histogram that I create?
Finally during the recognition stage, how do I match the image with a class defined by the SVM?
Did you look at their Caltech 101 example code, that is full implementation of an BoW approach.
Here is the part where they classify with pegasos and evaluate the results:
% --------------------------------------------------------------------
% Train SVM
% --------------------------------------------------------------------
lambda = 1 / (conf.svm.C * length(selTrain)) ;
w = [] ;
for ci = 1:length(classes)
perm = randperm(length(selTrain)) ;
fprintf('Training model for class %s\n', classes{ci}) ;
y = 2 * (imageClass(selTrain) == ci) - 1 ;
data = vl_maketrainingset(psix(:,selTrain(perm)), int8(y(perm))) ;
[w(:,ci) b(ci)] = vl_svmpegasos(data, lambda, ...
'MaxIterations', 50/lambda, ...
'BiasMultiplier', conf.svm.biasMultiplier) ;
model.b = conf.svm.biasMultiplier * b ;
model.w = w ;
% --------------------------------------------------------------------
% Test SVM and evaluate
% --------------------------------------------------------------------
% Estimate the class of the test images
scores = model.w' * psix + model.b' * ones(1,size(psix,2)) ;
[drop, imageEstClass] = max(scores, [], 1) ;
% Compute the confusion matrix
idx = sub2ind([length(classes), length(classes)], ...
imageClass(selTest), imageEstClass(selTest)) ;
confus = zeros(length(classes)) ;
confus = vl_binsum(confus, ones(size(idx)), idx) ;