I am trying to implement a BOW object recognition code in matlab. The process is slightly complicated and I've had a lot of trouble finding proper documentation on the procedure. So could someone double check if my plan below makes sense?
I'm using the VLSIFT library extensively here
1. Extract SIFT image descriptor with VLSIFT
2. Quantize the descriptors with k-means(vl_hikmeans)
3. Take quantized descriptors and create histogram(VL_HIKMEANSHIST)
4. Create SVM from histograms(VL_PEGASOS?)
I understand step 1-3, but I'm not quite sure if the function for SVM is correct.
VL_PEGASOS takes the following:
How exactly do I use this function with the histogram that I create?
Finally during the recognition stage, how do I match the image with a class defined by the SVM?

Did you look at their Caltech 101 example code, that is full implementation of an BoW approach.
Here is the part where they classify with pegasos and evaluate the results:
% --------------------------------------------------------------------
% Train SVM
% --------------------------------------------------------------------
lambda = 1 / (conf.svm.C * length(selTrain)) ;
w = [] ;
for ci = 1:length(classes)
perm = randperm(length(selTrain)) ;
fprintf('Training model for class %s\n', classes{ci}) ;
y = 2 * (imageClass(selTrain) == ci) - 1 ;
data = vl_maketrainingset(psix(:,selTrain(perm)), int8(y(perm))) ;
[w(:,ci) b(ci)] = vl_svmpegasos(data, lambda, ...
'MaxIterations', 50/lambda, ...
'BiasMultiplier', conf.svm.biasMultiplier) ;
model.b = conf.svm.biasMultiplier * b ;
model.w = w ;
% --------------------------------------------------------------------
% Test SVM and evaluate
% --------------------------------------------------------------------
% Estimate the class of the test images
scores = model.w' * psix + model.b' * ones(1,size(psix,2)) ;
[drop, imageEstClass] = max(scores, [], 1) ;
% Compute the confusion matrix
idx = sub2ind([length(classes), length(classes)], ...
imageClass(selTest), imageEstClass(selTest)) ;
confus = zeros(length(classes)) ;
confus = vl_binsum(confus, ones(size(idx)), idx) ;


The following seems obvious, yet it does not behave as I would expect. I want to do k-fold cross validation without using SCC packages, and thought I could just filter my data and run my own regressions on the subsets.
First I generate a variable with a random integer between 1 and 5 (5-fold cross validation), then I loop over each fold number. I want to filter the data by the fold number, but using a boolean filter fails to filter anything. Why?
Bonus: what would be the best way to capture all of the test MSEs and average them? In Python I would just make a list or a numpy array and take the average.
gen randint = floor((6-1)*runiform()+1)
recast int randint
forval b = 1(1)5 {
xtreg c.DepVar /// // training set
c.IndVar1 ///
c.IndVar2 ///
if randint !=`b' ///
, fe vce(cluster uuid)
xtreg c.DepVar /// // test set, needs to be performed with model above, not a
c.IndVar1 /// // new model...
c.IndVar2 ///
if randint ==`b' ///
, fe vce(cluster uuid)
EDIT: Test set needs to be performed with model fit to training set. I changed my comment in the code to reflect this.
Ultimately the solution to the filtering issue was I was using a scalar in quotes to define the bounds and I had:
replace randint = floor((`varscalar'-1)*runiform()+1)
instead of just
replace randint = floor((varscalar-1)*runiform()+1)
When and where to use the quotes in Stata is confusing to me. I cannot just use varscalar in a loop, I have to use `=varscalar', but I can for some reason use varscalar - 1 and get the expected result. Interestingly, I cannot use
replace randint = floor((`varscalar')*runiform()+1)
I suppose I should just use
replace randint = floor((`=varscalar')*runiform()+1)
So why is it ok to use the version with the minus one and without the equals sign??
The answer below is still extremely helpful and I learned much from it.
As a matter of fact, two different things are going on here that are not necessarily directly related. 1) How to filter data with a randomly generated integer value and 2) k-fold cross-validation procedure.
For the first one, I will leave an example below that could help you work things out using Stata with some tools that can be easily transferable to other problems (such as matrix generation and manipulation to store the metrics). However, I would call neither your sketch of code nor my example "k-fold cross-validation", mainly because they fit the model, both in the testing and in training data. Nonetheless, the case should be that strictly speaking, the model should be trained in the training data, and using those parameters, assess the performance of the model in testing data.
For further references on the procedure Scikit-learn has done brilliant work explaining it with several visualizations included.
That being said, here is something that could be helpful.
clear all
set seed 4
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
gen byte randint = runiformint(1, 5)
tab randint
randint | Freq. Percent Cum.
1 | 17 17.00 17.00
2 | 18 18.00 35.00
3 | 21 21.00 56.00
4 | 19 19.00 75.00
5 | 25 25.00 100.00
Total | 100 100.00
// create a matrix to store results
matrix res = J(5,4,.)
matrix colnames res = "R2_fold" "MSE_fold" "R2_hold" "MSE_hold"
matrix rownames res ="1" "2" "3" "4" "5"
// show formated empty matrix
matrix li res
R2_fold MSE_fold R2_hold MSE_hold
1 . . . .
2 . . . .
3 . . . .
4 . . . .
5 . . . .
// loop over different samples
forvalues b = 1/5 {
// run the model using fold == `b'
qui reg y x1 x2 if randint ==`b'
// save R squared training
matrix res[`b', 1] = e(r2)
// save rmse training
matrix res[`b', 2] = e(rmse)
// run the model using fold != `b'
qui reg y x1 x2 if randint !=`b'
// save R squared training (?)
matrix res[`b', 3] = e(r2)
// save rmse testing (?)
matrix res[`b', 4] = e(rmse)
// Show matrix with stored metrics
mat li res
R2_fold MSE_fold R2_hold MSE_hold
1 .50949187 1.2877728 .74155365 1.0070531
2 .89942838 .71776458 .66401888 1.089422
3 .75542004 1.0870525 .68884359 1.0517139
4 .68140328 1.1103964 .71990589 1.0329239
5 .68816084 1.0017175 .71229925 1.0596865
// some matrix algebra workout to obtain the mean of the metrics
mat U = J(rowsof(res),1,1)
mat sum = U'*res
/* create vector of column (variable) means */
mat mean_res = sum/rowsof(res)
// show the average of the metrics acros the holds
mat li mean_res
R2_fold MSE_fold R2_hold MSE_hold
c1 .70678088 1.0409408 .70532425 1.0481599

I need help with looping in Mata. I have to write a code for Beta coefficients for OLS in Mata using a loop. I am not sure how to call for the variables and create the code. Here is what I have so far.
foreach j of local X {
if { //for X'X
matrix XX = [mata:XX = cross(X,1 , X,1)]
else {
mata:Xy = cross(X,1 , y,0)
I am getting an error message "invalid syntax".
I'm not sure what you need the loop for. Perhaps you can provide more information about that. However the following example may help you implement OLS in mata.
Load example data from bcuse:
ssc install bcuse
bcuse bwght
x = st_data(., ("male", "parity","lfaminc","packs"))
cons = J(rows(x), 1, 1)
X = (x, cons)
y = st_data(., ("lbwght"))
beta_hat = (invsym(X'*X))*(X'*y)
e_hat = y - X * beta_hat
s2 = (1 / (rows(X) - cols(X))) * (e_hat' * e_hat)
B = J(cols(X), cols(X), 0)
n = rows(X)
for (i=1; i<=n; i++) {
B =B+(e_hat[i,1]*X[i,.])'*(e_hat[i,1]*X[i,.])
V_robust = (n/(n-cols(X)))*invsym(X'*X)*B*invsym(X'*X)
se_robust = sqrt(diagonal(V_robust))
V_ols = s2 * invsym(X'*X)
se_ols = sqrt(diagonal(V_ols))
This is far from the only way to implement OLS using mata. See the Stata Blog for another example using quadcross, I like my example because it preserves a little more of the matrix algebra in the code.

I use rolling regression in R quite a lot and my initial setup is something like:
dolm <- function(x) coef(lm(x[,1] ~ x[,2] + 0, data =
rollingCoef = rollapply(someData, 100, dolm)
Above example works perfectly, except it is slow if you have a lot of iterations.
To speed it up I've decided to experiment with Rcpp package.
First I substituted lm with fastLm, result is a bit faster but still slow. So that pushed me to attempt to write the entire rolling regression's coefficients function in c++ as for loop and than integrate it in R with Rcpp help.
So I've changed original RcppArmadillo's function fastLm to this:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
// [[Rcpp::export]]
List rollCoef(const arma::mat& X, const arma::colvec& y, double window ) {
double cppWindow = window - 1;
double matRows = X.n_rows;
double matCols = X.n_cols - 1;
arma::mat coef( matRows - cppWindow, X.n_cols); // matrix for estimated coefficients
//for loop for rolling regression.
for( double i = 0 ; i < matRows - cppWindow ; i++ )
coef.row(i) = arma::trans(arma::solve(X( arma::span(i,i + cppWindow), arma::span(0,matCols)) , y.rows(i,i + cppWindow)));
return List::create(_["coefficients"] = coef);
and than download it to R with sourceCpp(file=".../rollCoef.cpp")
So it's much faster than rollapply and it worked fine on small examples, but than I applied it to ~200000 observations of data it produced ~half of NA's in output, in the same time rollapply/fastLm combination didn't produce any.
So here I need some help. What is wrong with my function? Why are there NA's in my function output, and no NA's in rollapply/fastLm, however, if I understand right, them both based on arma::solve? Any help is highly appreciated.
Here is reproducible code:
myData <- source_DropboxData(file = "example.csv",
key = "cbrmkkbssu5bn96", sep = ",", header = TRUE)
## in order to use my custom function "rollCoef" you should download it to R.
## The c++ code is presented above in the main question.
## Download it where you want as "rollCoef.cpp" and then download it to R with:
sourceCpp(file=".../rollCoeff.cpp"). # there should be your actual path.
myCoef = rollCoef(as.matrix(myData[,2]),myData[,1],260)
summary(unlist(myCoef)) # 80923 NA's
dolm = function(x) coef(fastLmPure(as.matrix(x[,2]), x[,1]))
myCoef2 = rollapply(myData, 260, dolm, by.column = FALSE)
summary(myCoef2) # 80923 NA's
dolm2 = function(x) coef(fastLm(x[,1] ~ x[,2] + 0, data =
myCoef3 = rollapply(myData, 260, dolm2, by.column = FALSE)
summary(myCoef3) # !!! No NA's !!!
head(unlist(myCoef)) ; head(unlist(myCoef2)) ; head(myCoef3)
So the output of my function is identical to output of RcppArmadillo's fastLmPure combined with rollapply and them both produce NA's, but rollapply with fastLm does not. As I understand, for example from HERE and HERE fastLm is basically calling to fastLmPure, but why is there no NA's in the third method than? Is there some additional capabilities in fastLm that prevent NA's that I didn't spotted?
There is an entire package RcppRoll to do just that custom rolling -- and you should be able to extend it and its rollit() function to do rolling lm() as well.

How can I conduct a hypothesis test in Stata when my predictor perfectly predicts my dependent variable?
I would like to run the same regression over many subsets of my data. For each regression, I would then like to test the hypothesis that beta_1 = 1/2. However, for some subsets, I have perfect collinearity, and Stata is not able to calculate standard errors.
For example, in the below case,
sysuse auto, clear
gen value = 2*foreign*(price<6165)
gen value2 = 2*foreign*(price>6165)
gen id = 1 + (price<6165)
I get the output
. reg foreign value value2 weight length, noconstant
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 4, 70) = .
Model | 22 4 5.5 Prob > F = .
Residual | 0 70 0 R-squared = 1.0000
-------------+------------------------------ Adj R-squared = 1.0000
Total | 22 74 .297297297 Root MSE = 0
foreign | Coef. Std. Err. t P>|t| [95% Conf. Interval]
value | .5 . . . . .
value2 | .5 . . . . .
weight | 3.54e-19 . . . . .
length | -6.31e-18 . . . . .
. test value = .5
( 1) value = .5
F( 1, 70) = .
Prob > F = .
In the actual data, there is usually more variation. So I can identify the cases where the predictor does a very good job of predicting the DV--but I miss those cases where prediction is perfect. Is there a way to conduct a hypothesis test that catches these cases?
The end goal would be to classify observations within subsets based on the hypothesis test. If I cannot reject the hypothesis at the 95% confidence level, I classify the observation as type 1. Below, both groups would be classified as type 1, though I only want the second group.
gen type = .
for values 1/2 {
quietly: reg foreign value value2 weight length if id = `i', noconstant
test value = .5
replace type = 1 if r(p)>.05
There is no way to do this out of the box that I'm aware of. Of course you could program it yourself to get an approximation of the p-value in these cases. The standard error is missing here because the relationship between x and y is perfectly collinear. There is no noise in the model, nothing deviates.
Interestingly enough though, the standard error of the estimate is useless in this case anyway. test performs a Wald test for beta_i = exp against beta_i != exp, not a t-test.
The Wald test uses the variance-covariance matrix from the regression. To see this yourself, refer to the Methods and formulas section here and run the following code:
(also, if you remove the -1 from gen mpg2 = and run, you will see the issue)
sysuse auto, clear
gen mpg2 = mpg * 2.5 - 1
qui reg mpg2 mpg, nocons
* collect matrices to calculate Wald statistic
mat b = e(b) // Vector of Coefficients
mat V = e(V) // Var-Cov matrix
mat R = (1) // for use in Rb-r. This does not == [0,1] because of
the use of the noconstant option in regress
mat r = (2.5) // Value you want to test for equality
mat W = (R*b-r)'*inv(R*V*R')*(R*b-r)
// This is where it breaks for you, because with perfect collinearity, V == 0
reg mpg2 mpg, nocons
test mpg = 2.5
sca F = r(F)
sca list F
mat list W
Now, as #Brendan Cox suggested, you might be able to simply use the missing value returned in r(p) to condition your replace command. Depending on exactly how you are using it. A word of caution on this, however, is that when the relationship between some x and y is such that y = 2x, and you want to test x = 5 vs test x = 2, you will want to be very careful about the interpretation of missing p-values - In both cases they are classified as type == 1, where the test x = 2 command should not result in that outcome.
Another work-around would be to simply set p = 0 in these cases, since the variance estimate will asymptotically approach 0 as the linear relationship becomes near perfect, and thus the Wald statistic will approach infinity (driving p down, all else equal).
A final yet more complicated work-around in this case could be to calculate the F-statistic manually using the formula in the manual, and setting V to some arbitrary, yet infinitesimally small number. I've included code to do this below, but it is quite a bit more involved than simply issuing the test command, and in truth only an approximation of the actual p-value from the F distribution.
clear *
sysuse auto
gen i = ceil(_n/5)
qui sum i
gen mpg2 = mpg * 2 if i <= 5 // Get different estimation results
replace mpg2 = mpg * 10 if i > 5 // over different subsets of data
gen type = .
local N = _N // use for d.f. calculation later
local iMax = r(max) // use to iterate loop
forvalues i = 1/`iMax' {
qui reg mpg2 mpg if i == `i', nocons
mat b`i' = e(b) // collect returned results for Wald stat
mat V`i' = e(V)
sca cov`i' = V`i'[1,1]
mat R`i' = (1)
mat r`i' = (2) // Value you wish to test against
if (cov`i' == 0) { // set V to be very small if Variance = 0 & calculate Wald
mat V`i' = 1.0e-14
mat W`i' = (R`i'*b`i'-r`i')'*inv(R`i'*V`i'*R`i'')*(R`i'*b`i'-r`i')
sca W`i' = W`i'[1,1] // collect Wald statistic into scalar
sca p`i' = Ftail(1,`N'-2, W`i') // pull p-value from F dist
if p`i' > .05 {
replace type = 1 if i == `i'
Also note that this workaround will become slightly more involved if you want to test multiple coefficients.
I'm not sure if I advise these approaches without issuing a word of caution considering you are in a very real sense "making up" variance estimates, but without a variance estimate you wont be able to test the coefficients at all.

I use Stata for estimating rolling standard deviation of ROA (using 4 window in previous year). Now, I would like to keep only those rolling standard deviation that has at least 3 observation (out of 4) in the ROA. How can I do this using Stata?
ROA roa_sd
. .
. .
. .
.0108869 .
.0033411 .
.0032814 .0053356 (this value should be missing as it was calculated using only 2 valid value)
.0030827 .0043739
.0029793 .0038275
Your question is answered on the blog post I link to above in the comments. You can use rolling and then add an additional screen to discard sigma when the number of observations doesn't meet your threshold.
But for simple calculations like sigma and beta (i.e., standard deviation and univariate regression coefficient) you can do much better with a more manual approach. Compare the rolling solution with my manual solution.
/* generate panel by adpating the linked code */
set obs 20000
gen date = _n
gen id = floor((_n - 1) / 20) + 1
gen roa = int((100) * runiform())
replace roa = . in 1/4
replace roa = . in 10/12
replace roa = . in 18/20
/* solution with rolling */
/* */
timer on 1
xtset id date
rolling sd2 = r(sd), window(4) keep(date) saving(f2, replace): sum roa
merge 1:1 date using f2, nogenerate keepusing(sd2)
xtset id date
gen tag = missing(l3.roa) + missing(l2.roa) + missing(l1.roa) + missing(roa) > 1
gen sd = sd2 if (tag == 0)
timer off 1
/* my solution */
timer on 2
rolling_sd roa, window(4) minimum(3)
timer off 2
/* compare */
timer list
list in 1/50
I show the manual solution is much faster.
. /* compare */
. timer list
1: 132.38 / 1 = 132.3830
2: 0.10 / 1 = 0.0990
Save the following as rolling_sd.ado in your personal ado file directory (or in your current working directory). I'm sure that someone could further streamline this code. Note that this code has the additional advantage of meeting the minimum data requirements at the front edge of the window (i.e., calculates sigma with first three observations, rather than waiting for all four).
*! 0.2 Richard Herron 3/30/14
* added minimum data requirement
*! 0.1 Richard Herron 1/12/12
program rolling_sd
version 11.2
syntax varlist(numeric), window(int) minimum(int)
* get dependent and indpendent vars from varlist
tempvar n miss xs x2s nonmiss1 nonmiss2 sigma1 sigma2
local w = `window'
local m = `minimum'
* generate cumulative sums and missing values
bysort `r(panelvar)' (`timevar'): generate `n' = _n
by `r(panelvar)': generate `miss' = sum(missing(`varlist'))
by `r(panelvar)': generate `xs' = sum(`varlist')
by `r(panelvar)': generate `x2s' = sum(`varlist' * `varlist')
* generate variance 1 (front of window)
generate `nonmiss1' = `n' - `miss'
generate `sigma1' = sqrt((`x2s' - `xs'*`xs'/`nonmiss1')/(`nonmiss1' - 1)) if inrange(`nonmiss1', `m', `w') & !missing(`nonmiss1')
* generate variance 2 (back of window, main part)
generate `nonmiss2' = `w' - s`w'.`miss'
generate `sigma2' = sqrt((s`w'.`x2s' - s`w'.`xs'*s`w'.`xs'/`nonmiss2')/(`nonmiss2' - 1)) if inrange(`nonmiss2', `m', `w') & !missing(`nonmiss2')
* return standard deviation
egen sigma = rowfirst(`sigma2' `sigma1')