I am developing a package in R that I would like to convert to Rcpp for better performance. I'm new to Rcpp (and C++ in general.) My problem is that the Rcpp function I've written works fine if I run it many times with one set of arguments, but if I try to loop it over many combinations of arguments, it springs memory leaks and causes the R session to abort.
Here is the code in R, which holds up well to any test I throw at it:
raw_noise <- function(timesteps, mu, sigma, phi) {
delta <- mu * (1 - phi)
variance <- sigma^2 * (1 - phi^2)
noise <- vector(mode = "double", length = timesteps)
noise[1] <- c(rnorm(1, mu, sigma))
for (i in (1:(timesteps - 1))) {
noise[i + 1] <- delta + phi * noise[i] + rnorm(1, 0, sqrt(variance))
}
return(noise)
}
Here is the code in Rcpp, using three Rcpp sugar functions (pow, sqrt, rnorm):
NumericVector raw_noise(int timesteps, double mu, double sigma, double phi) {
double delta = mu * (1 - phi);
double variance = pow(sigma, 2.0) * (1 - pow(phi, 2.0));
NumericVector noise(timesteps);
noise[0] = R::rnorm(mu, sigma);
for(int i = 0; i < timesteps; ++i) {
noise[i+1] = delta + phi*noise[i] + R::rnorm(0, sqrt(variance));
}
return noise;
}
What really confuses me is that this code runs without problems:
library(purrr)
rerun(10000, raw_noise(timesteps = 30, mu = 0.5, sigma = 0.2, phi = 0.3))
But when I run this code:
test_loop <- function(timesteps, mu, sigma, phi, replicates) {
params <- cross_df(list(timesteps = timesteps, phi = phi, mu = mu, sigma =
sigma))
for (i in 1:nrow(params)) {
print(params[i,])
pmap(params[i,], raw_noise)
}
}
library(purrr)
test_loop(timesteps=c(5, 6, 7, 8, 9, 10), mu=c(0.2, 0.5), sigma=c(0.2, 0.5),
phi=c(0, 0.1))
More often than not, the R session aborts and RStudio crashes altogether. But sometimes I manage to catch this error message before the R session aborts:
Error in match(x, table, nomatch = 0L) : GC encountered a node
(0x10db7af50) with an unknown SEXP type: NEWSXP at memory.c:1692
As I understand it, NEWSXP is an exotic object type in R that doesn't come up very often. What's happening looks to me like a memory leak, but I'm not at all sure how to fix it. Like I said, I'm new to Rcpp and C++ generally so I'd appreciate any nudges in the right direction.
You have an out of bounds error:
for(int i = 0; i < timesteps; ++i)
causes
noise[i+1]
to exceed the defined range since C++ indices start at 0 and not 1.
For example, 0 to timesteps - 1 has a length of timesteps and, thus, is okay.
but
0 to timesteps would have a length of timesteps + 1
This can be seen if you change noise[i+1] to noise(i+1), which performs a bounds check on the requested index.
Error in raw_noise(100, 2, 3, 0.2) :
Index out of bounds: [index=100; extent=100].
To address this, make the following change:
NumericVector raw_noise(int timesteps, double mu, double sigma, double phi) {
double delta = mu * (1 - phi);
double variance = pow(sigma, 2.0) * (1 - pow(phi, 2.0));
NumericVector noise(timesteps);
noise[0] = R::rnorm(mu, sigma);
// change here
for(int i = 0; i < timesteps - 1; ++i) { // 1 less time step
noise[i+1] = delta + phi*noise[i] + R::rnorm(0, sqrt(variance));
}
return noise;
}
Related
Well, I had task to create function that does Fourier series with some mathematical function, so I found all the formulas, but the main problem is when I change count of point on some interval to draw those series I have very strange artifact:
This is Fourier series of sin(x) on interavl (-3.14; 314) with 100 point for tabulation
And this is same function with same interval but with 100000 points for tabulation
Code for Fourier series coeficients:
void fourieSeriesDecompose(std::function<double(double)> func, double period, long int iterations, double *&aParams, double *&bParams){
aParams = new double[iterations];
aParams[0] = integrateRiemans(func, 0, period, 1000);
for(int i = 1; i < iterations; i++){
auto sineFunc = [&](double x) -> double { return 2 * (func(x) * cos((2 * x * i * M_PI) / period)); };
aParams[i] = integrateRiemans(sineFunc, -period / 2, period / 2, 1000) / period;
}
bParams = new double[iterations];
for(int i = 1; i < iterations; i++){
auto sineFunc = [&](double x) -> double { return 2 * (func(x) * sin(2 * (x * (i + 1) * M_PI) / period)); };
bParams[i] = integrateRiemans(sineFunc, -period / 2, period / 2, 1000) / period;
}
}
This code I use to reproduce function using found coeficients:
double fourieSeriesCompose(double x, double period, long iterations, double *aParams, double *bParams){
double y = aParams[0];
for(int i = 1; i < iterations; i++){
y += sqrt(aParams[i] * aParams[i] + bParams[i] * bParams[i]) * cos((2 * i * x * M_PI) / period - atan(bParams[i] / aParams[i]));
}
return y;
}
And the runner code
double period = M_PI * 2;
auto startFunc = [](double x) -> double{ return sin(x); };
fourieSeriesDecompose(*startFunc, period, 1000, aCoeficients, bCoeficients);
auto readyFunc = [&](double x) -> double{ return fourieSeriesCompose(x, period, 1000, aCoeficients, bCoeficients); };
tabulateFunc(readyFunc);
scaleFunc();
//Draw methods after this
see:
How to compute Discrete Fourier Transform?
So if I deciphered it correctly the aParams,bParams represent the real and imaginary part of the result then the angles in sin and cos must be the same but you have different! You got this:
auto sineFunc = [&](double x) -> double { return 2*(func(x)*cos((2* x* i *M_PI)/period));
auto sineFunc = [&](double x) -> double { return 2*(func(x)*sin( 2*(x*(i+1)*M_PI)/period));
as you can see its not the same angle. Also what is period? You got iterations! if it is period of the function you want to transform then it should be applied to it and not to the kernel ... Also integrateRiemans does what? its the nested for loop to integrate the furrier transform? Btw. hope that func is real domain otherwise the integration/sumation needs both real and imaginary part not just one ...
So what you should do is:
create (cplx) table of the func(x) data on the interval you want with iterations samples
so for loop where x = x0+i*(x1-x0)/(iterations-1) and x0,x1 is the range you want the func to sample. Lets call it f[i]
for (i=0;i<iteration;i++) f[i]=func(x0+i*(x1-x0)/(iterations-1));
furrier transform it
something like this:
for (i=0;i<iteration;i++) a[i]=b[i]=0;
for (j=0;j<iteration;j++)
for (i=0;i<iteration;i++)
{
a[j]+=f[i]*cos(-2.0*M_PI*i*j/iterations);
b[j]+=f[i]*sin(-2.0*M_PI*i*j/iterations);
}
now a[],b[] should hold your slow DFT result ... beware integer rounding ... depending on compiler you might need to cast some stuff to double to avoid integer rounding.
I edited the lasso code from this site to use it for multiple lambda values.
I used lassoshooting package for one lambda value (this package works for one lambda value) and glmnet for multiple lambda values for comparison.
The coefficient estimates are different and this is expected because of standardization and scaling back to original scale. This is out of scope and not important here.
For one parameter case, lassoshooting is 1.5 times faster.
Both methods used all 100 lambda values in my code for multiple lambda case. But glmnet is 7.5 times faster than my cpp code. Of course, I expected that glmnet was faster, but this amount seems too much. Is it normal or is my code wrong?
EDIT
I also attached lshoot function which calculates coefficient path in an R loop. This outperforms my cpp code too.
Can I improve my cpp code?
C++ code:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
vec softmax_cpp(const vec & x, const vec & y) {
return sign(x) % max(abs(x) - y, zeros(x.n_elem));
}
// [[Rcpp::export]]
mat lasso(const mat & X, const vec & y, const vec & lambda,
const double tol = 1e-7, const int max_iter = 10000){
int p = X.n_cols; int lam = lambda.n_elem;
mat XX = X.t() * X;
vec Xy = X.t() * y;
vec Xy2 = 2 * Xy;
mat XX2 = 2 * XX;
mat betas = zeros(p, lam); // to store the betas
vec beta = zeros(p); // initial beta for each lambda
bool converged = false;
int iteration = 0;
vec beta_prev, aj, cj;
for(int l = 0; l < lam; l++){
while (!converged && (iteration < max_iter)){
beta_prev = beta;
for (int j = 0; j < p; j++){
aj = XX2(j,j);
cj = Xy2(j) - dot(XX2.row(j), beta) + beta(j) * XX2(j,j);
beta(j) = as_scalar(softmax_cpp(cj / aj, as_scalar(lambda(l)) / aj));
}
iteration = iteration + 1;
converged = norm(beta_prev - beta, 1) < tol;
}
betas.col(l) = beta;
iteration = 0;
converged = false;
}
return betas;
}
R code:
library(Rcpp)
library(rbenchmark)
library(glmnet)
library(lassoshooting)
sourceCpp("LASSO.cpp")
library(ElemStatLearn)
X <- as.matrix(prostate[,-c(9,10)])
y <- as.matrix(prostate[,9])
lambda_one <- 0.1
benchmark(cpp=lasso(X,y,lambda_one),
lassoshooting=lassoshooting(X,y,lambda_one)$coefficients,
order="relative", replications=100)[,1:4]
################################################
lambda <- seq(0,10,len=100)
benchmark(cpp=lasso(X,y,lambda),
glmn=coef(glmnet(X,y,lambda=lambda)),
order="relative", replications=100)[,1:4]
####################################################
EDIT
lambda <- seq(0,10,len=100)
lshoot <- function(lambda){
betas <- matrix(NA,8,100)
for(l in 1:100){
betas[, l] <- lassoshooting(X,y,lambda[l])$coefficients
}
return(betas)
}
benchmark(cpp=lasso(X,y,lambda),
lassoshooting_loop=lshoot(lambda),
order="relative", replications=300)[,1:4]
Results for one parameter case:
test replications elapsed relative
2 lassoshooting 300 0.06 1.0
1 cpp 300 0.09 1.5
Results for multiple parameter case:
test replications elapsed relative
2 glmn 300 0.70 1.000
1 cpp 300 5.24 7.486
Results for lassoshooting loop and cpp:
test replications elapsed relative
2 lassoshooting_loop 300 4.06 1.000
1 cpp 300 6.38 1.571
Package {glmnet} uses warm starts and special rules for discarding lots of predictors, which makes fitting the whole "regularization path" very fast.
See their paper.
I have the following loop for a monte-carlo computation I am performing:
the variables below are pre-computed/populated and is defined as:
w_ = std::vector<std::vector<double>>(150000, std::vector<double>(800));
C_ = Eigen::MatrixXd(800,800);
Eigen::VectorXd a(800);
Eigen::VectorXd b(800);
The while loop is taking me about 570 seconds to compute.Just going by the the loops I understand that I have nPaths*m = 150,000 * 800 = 120,000,000 sets of computations happening (I have not taken into account the cdf computations handled by boost libraries).
I am a below average programmer and was wondering if there are any obvious mistakes which I am making which maybe slowing the computation down. Or is there any other way to handle the computation which can speed things up.
int N(0);
int nPaths(150000);
int m(800);
double Varsum(0.);
double err;
double delta;
double v1, v2, v3, v4;
Eigen::VectorXd d = Eigen::VectorXd::Zero(m);
Eigen::VectorXd e = Eigen::VectorXd::Zero(m);
Eigen::VectorXd f = Eigen::VectorXd::Zero(m);
Eigen::VectorXd y;
y0 = Eigen::VectorXd::Zero(m);
boost::math::normal G(0, 1.);
d(0) = boost::math::cdf(G, a(0) / C_(0, 0));
e(0) = boost::math::cdf(G, b(0) / C_(0, 0));
f(0) = e(0) - d(0);
while (N < (nPaths-1))
{
y = y0;
for (int i = 1; i < m; i++)
{
v1 = d(i - 1) + w_[N][(i - 1)]*(e(i - 1) - d(i - 1));
y(i - 1) = boost::math::quantile(G, v1);
v2 = (a(i) - C_.row(i).dot(y)) / C_(i, i);
v3 = (b(i) - C_.row(i).dot(y)) / C_(i, i);
d(i) = boost::math::cdf(G, v2);
e(i) = boost::math::cdf(G, v3);
f(i) = (e(i) - d(i))*f(i - 1);
}
N++;
delta = (f(m-1) - Intsum) / N;
Intsum += delta;
Varsum = (N - 2)*Varsum / N + delta*delta;
err = alpha_*std::sqrt(Varsum);
}
If I understand your code right, the running time is actually O(nPaths*m*m)=10^11, due to the dot-product C_.row(i).dot(y) which needs O(m) operation.
You could speed up the program by factor of two by not calculating it twice:
double prod=C_.row(i).dot(y)
v2 = (a(i) - prod) / C_(i, i);
v3 = (b(i) - prod) / C_(i, i);
but maybe compiler already does it for you.
The other thing is that y consists of zeros (at least at the beginning) so you don't have to do the full dot-product but only until current value of i. That should give another factor 2 speed up.
So taken into the account the sheer number of operation your timings are not so bad. There is some room for improvement of the code, but if you are interested in speeding up some orders of magnitude you probably should be thinking about changing your formulation.
I try to make a model fit using Levenberg-marquardt's method according to numerical recipes.
The Problem is: it does not converge or when it does, it's not precise... or at least the covariant matrix is strange.
int i=0;
for (i = 0; i < 3e4; i++) {
mrqmin(x, y, sig, NPCalib, a, ia, 3, covar, alpha, &chisk, afunc,
&alamda);
if (chisk < 1e-8)
sumchisk++;
if (sumchisk > 5)
break;
if (alamda > 1e8)
alamda = 1e8;
}
(x,y) are 3 points (double) that work pretty well with the form y=a(x-x0)^2.
using sumchisk like this is the recommendation of numerical recipees for using this function.
alamda is capped at the top here as otherwise there might have been an overflow.
Other definitions and data-points:
double a[4] = {0.0, 0.0001, 100.0, -1};
int ia[4] = {0.0, 1, 1, 0};
double *x = {0.0, 799.157549545577, 799.92196995454, 800.683769692575};
double *y = {0.0, 524.26491, 525.26768, 526.26586};
double *sig = {0.0, 0.1*y[1], 0.1*y[2], 0.1*y[3]};
double **covar = new double*[4];
covar[1] = new double[4];
covar[2] = new double[4];
covar[3] = new double[4];
double **alpha = new double*[4];
alpha[1] = new double[4];
alpha[2] = new double[4];
alpha[3] = new double[4];
double chisk = 0;
double alamda = -1;
void afunc(int i, double x[], double a[], double *y, double dyda[], int ma)
{
*y = a[1] * pow(x[i] + a[2], 2) / pow(1 + a[3] * CT[i - 1], 2);
dyda[1] = pow(x[i] + a[2], 2) / pow(1 + a[3] * CT[i - 1], 2);
dyda[2] = (2 * a[1] * (x[i] + a[2])) / pow
(1 + a[3] * CalibTurn[i - 1], 2);
dyda[3] = (-2 * a[1] * CT[i - 1] * pow(x[i] + a[2], 2)) / pow
(1 + a[3] * CT[i - 1], 3);
}
I changed the nr-sourcecode to use double instead of float. The first array-element is not used because this comes from fortran-code and I didn't feel like changing such a small detail.
The model also contains a 3. parameter, which isn't used in this fit and thus remains a[3]=-1, because ia[3]=0. ia[]=1 means the parameter is about to get fitted...
However, Now I have the problem that sometimes this doesn't converge. It finishes with alamda=1e8 and i=3e4. Especially when I set the treshold for chisk lower.
The sets of parameters seem to be fine, though... the chisk is e.g. about 1e-6 and the parameters seem fine, but looking at the diagonals of the covariant-matrix (which should give the squared standard deviation of each parameter), there is some rubish like ~800000 for a parameter 0.0001.
Does anyone know what I did wrong when using this algorithm?
Anything specific I need to write into covar/alpha when I start? Can the sig be set like this?
My question is not how to filter an image using the laplacian of gaussian (basically using filter2D with the relevant kernel etc.).
What I want to know is how I generate the NxN kernel.
I'll give an example showing how I generated a [Winsize x WinSize] Gaussian kernel in openCV.
In Matlab:
gaussianKernel = fspecial('gaussian', WinSize, sigma);
In openCV:
cv::Mat gaussianKernel = cv::getGaussianKernel(WinSize, sigma, CV_64F);
cv::mulTransposed(gaussianKernel,gaussianKernel,false);
Where sigma and WinSize are predefined.
I want to do the same for a Laplacian of Gaussian.
In Matlab:
LoGKernel = fspecial('log', WinSize, sigma);
How do I get the exact kernel in openCV (exact up to negligible numerical differences)?
I'm working on a specific application where I need the actual kernel values and simply finding another way of implementing LoG filtering by approximating Difference of gaussians is not what I'm after.
Thanks!
You can generate it manually, using formula
LoG(x,y) = (1/(pi*sigma^4)) * (1 - (x^2+y^2)/(sigma^2))* (e ^ (- (x^2 + y^2) / 2sigma^2)
http://homepages.inf.ed.ac.uk/rbf/HIPR2/log.htm
cv::Mat kernel(WinSize,WinSize,CV_64F);
int rows = kernel.rows;
int cols = kernel.cols;
double halfSize = (double) WinSize / 2.0;
for (size_t i=0; i<rows;i++)
for (size_t j=0; j<cols;j++)
{
double x = (double)j - halfSize;
double y = (double)i - halfSize;
kernel.at<double>(j,i) = (1.0 /(M_PI*pow(sigma,4))) * (1 - (x*x+y*y)/(sigma*sigma))* (pow(2.718281828, - (x*x + y*y) / 2*sigma*sigma));
}
If function above is not OK, you can simply rewrite matlab version of fspecial:
case 'log' % Laplacian of Gaussian
% first calculate Gaussian
siz = (p2-1)/2;
std2 = p3^2;
[x,y] = meshgrid(-siz(2):siz(2),-siz(1):siz(1));
arg = -(x.*x + y.*y)/(2*std2);
h = exp(arg);
h(h<eps*max(h(:))) = 0;
sumh = sum(h(:));
if sumh ~= 0,
h = h/sumh;
end;
% now calculate Laplacian
h1 = h.*(x.*x + y.*y - 2*std2)/(std2^2);
h = h1 - sum(h1(:))/prod(p2); % make the filter sum to zero
I want to thank old-ufo for nudging me in the correct direction.
I was hoping I won't have to reinvent the wheel by doing a quick matlab-->openCV conversion but guess this is the best solution I have for a quick solution.
NOTE - I did this for square kernels only (easy to modify otherwise, but I have no need for that so...).
Maybe this can be written in a more elegant form but is a quick job I did so I can carry on with more pressing matters.
From main function:
int WinSize(7); int sigma(1); // can be changed to other odd-sized WinSize and different sigma values
cv::Mat h = fspecialLoG(WinSize,sigma);
And the actual function is:
// return NxN (square kernel) of Laplacian of Gaussian as is returned by Matlab's: fspecial(Winsize,sigma)
cv::Mat fspecialLoG(int WinSize, double sigma){
// I wrote this only for square kernels as I have no need for kernels that aren't square
cv::Mat xx (WinSize,WinSize,CV_64F);
for (int i=0;i<WinSize;i++){
for (int j=0;j<WinSize;j++){
xx.at<double>(j,i) = (i-(WinSize-1)/2)*(i-(WinSize-1)/2);
}
}
cv::Mat yy;
cv::transpose(xx,yy);
cv::Mat arg = -(xx+yy)/(2*pow(sigma,2));
cv::Mat h (WinSize,WinSize,CV_64F);
for (int i=0;i<WinSize;i++){
for (int j=0;j<WinSize;j++){
h.at<double>(j,i) = pow(exp(1),(arg.at<double>(j,i)));
}
}
double minimalVal, maximalVal;
minMaxLoc(h, &minimalVal, &maximalVal);
cv::Mat tempMask = (h>DBL_EPSILON*maximalVal)/255;
tempMask.convertTo(tempMask,h.type());
cv::multiply(tempMask,h,h);
if (cv::sum(h)[0]!=0){h=h/cv::sum(h)[0];}
cv::Mat h1 = (xx+yy-2*(pow(sigma,2))/(pow(sigma,4));
cv::multiply(h,h1,h1);
h = h1 - cv::sum(h1)[0]/(WinSize*WinSize);
return h;
}
There is some difference between your function and the matlab version:
http://br1.einfach.org/tmp/log-matlab-vs-opencv.png.
Above is matlab fspecial('log', 31, 6) and below is the result of your function with the same parameters. Somehow the hat is more 'bent' - is this intended and what is the effect of this in later processing?
I can create a kernel very similar to the matlab one with these functions, which just directly reflect the LoG formula:
float LoG(int x, int y, float sigma) {
float xy = (pow(x, 2) + pow(y, 2)) / (2 * pow(sigma, 2));
return -1.0 / (M_PI * pow(sigma, 4)) * (1.0 - xy) * exp(-xy);
}
static Mat LOGkernel(int size, float sigma) {
Mat kernel(size, size, CV_32F);
int halfsize = size / 2;
for (int x = -halfsize; x <= halfsize; ++x) {
for (int y = -halfsize; y <= halfsize; ++y) {
kernel.at<float>(x+halfsize,y+halfsize) = LoG(x, y, sigma);
}
}
return kernel;
}
Here's a NumPy version that is directly translated from the fspecial function in MATLAB.
import numpy as np
import sys
def get_log_kernel(siz, std):
x = y = np.linspace(-siz, siz, 2*siz+1)
x, y = np.meshgrid(x, y)
arg = -(x**2 + y**2) / (2*std**2)
h = np.exp(arg)
h[h < sys.float_info.epsilon * h.max()] = 0
h = h/h.sum() if h.sum() != 0 else h
h1 = h*(x**2 + y**2 - 2*std**2) / (std**4)
return h1 - h1.mean()
The code below is the exact equivalent to fspecial('log', p2, p3):
def fspecial_log(p2, std):
siz = int((p2-1)/2)
x = y = np.linspace(-siz, siz, 2*siz+1)
x, y = np.meshgrid(x, y)
arg = -(x**2 + y**2) / (2*std**2)
h = np.exp(arg)
h[h < sys.float_info.epsilon * h.max()] = 0
h = h/h.sum() if h.sum() != 0 else h
h1 = h*(x**2 + y**2 - 2*std**2) / (std**4)
return h1 - h1.mean()
I wrote exact Implementation of Matlab fspecial function in OpenCV
function:
Mat C_fspecial_LOG(double* kernel_size,double sigma)
{
double size[2]={ (kernel_size[0]-1)/2 , (kernel_size[1]-1)/2};
double std = sigma;
const double eps = 2.2204e-16;
cv::Mat kernel(kernel_size[0],kernel_size[1],CV_64FC1,0.0);
int row=0,col=0;
for (double y = -size[0]; y <= size[0]; ++y,++row)
{
col=0;
for (double x = -size[1]; x <= size[1]; ++x,++col)
{
kernel.at<double>(row,col)=exp( -( pow(x,2) + pow(y,2) ) /(2*pow(std,2)));
}
}
double MaxValue;
cv::minMaxLoc(kernel,nullptr,&MaxValue,nullptr,nullptr);
Mat condition=~(kernel < eps*MaxValue)/255;
condition.convertTo(condition,CV_64FC1);
kernel = kernel.mul(condition);
cv::Scalar SUM = cv::sum(kernel);
if(SUM[0]!=0)
{
kernel /= SUM[0];
}
return kernel;
}
usage of this function :
double kernel_size[2] = {4,4}; // kernel size set to 4x4
double sigma = 2.1;
Mat kernel = C_fspecial_LOG(kernel_size,sigma);
compare OpenCV result with Matlab:
opencv result:
[0.04918466596701741, 0.06170341496034986, 0.06170341496034986, 0.04918466596701741;
0.06170341496034986, 0.07740850411228289, 0.07740850411228289, 0.06170341496034986;
0.06170341496034986, 0.07740850411228289, 0.07740850411228289, 0.06170341496034986;
0.04918466596701741, 0.06170341496034986, 0.06170341496034986, 0.04918466596701741]
Matlab result for fspecial('gaussian', 4, 2.1) :
0.0492 0.0617 0.0617 0.0492
0.0617 0.0774 0.0774 0.0617
0.0617 0.0774 0.0774 0.0617
0.0492 0.0617 0.0617 0.0492
Just for the sake of reference, here is a Python implementation which creates the LoG filter kernel to detect blobs of a pre-defined radius in pixels.
def create_log_filter_kernel(r_in_px: float):
"""
Creates a LoG filter-kernel to detect blobs of a given radius r_in_px.
\[
LoG(x,y) = \frac{-1}{\pi\sigma^4}\left(1 - \frac{x^2 + y^2}{2\sigma^2}\right)e^{\frac{-(x^2+y^2)}{2\sigma^2}}
\]
Look for maxima if blob is black, minima if blob is white.
:param r_in_px:
:return: filter kernel
"""
# sigma from radius: LoG has zero-crossing at $1 - \frac{x^2 + y^2}{2\sigma^2} = 0$
# i.e. r^2 = 2\sigma^2$ and thus $sigma = r / \sqrt{2}$
sigma = r_in_px/np.sqrt(2)
# ksize such that filter covers $3\sigma$
ksize = int(np.round(sigma*3))*2 + 1
# setup filter
xgv = np.arange(0, ksize) - ksize / 2
ygv = np.arange(0, ksize) - ksize / 2
x, y = np.meshgrid(xgv, ygv)
kernel = -1 / (np.pi * sigma**4) * (1 - (x**2 + y**2) / (2*sigma**2)) * np.exp(-(x**2 + y**2) / (2 * sigma**2))
#normalize to sum zero (does not change zero crossing, I tried it out for r < 100)
kernel -= np.sum(kernel) / ksize**2
#this is important: normalize such that positive/negative parts are comparable over different scales
kernel /= np.sum(kernel[kernel>0])
return kernel