Outlier detection in small sets - c++

Is there a good algorithm for detecting outliers in small sets of decimal numbers? The best idea I have come up with so far is a kind of recursive standard deviation based approach, but it seems a bit computationally expensive.
I'm using c++, so any existing functionality in say Boost or other maths helper libraries is welcome in your answers.
Thanks.

You can do it in O(n) time with an online variance algorithm (http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm) and then a second pass to mark outliers.

Related

Best online solution for solving linear programming problems

What is the best online solution to solve a linear programming problem?
I heard about several like Gurobi.
One thing I especially want is the possibility to get an approximate solution when the exact resolution takes too long.
The most comprehensive online optimization system is NEOS. It takes models in a variety of input formats and has a wide range of solvers.
Many solvers have settings to allow them to terminate early, even before optimality is reached, if you want an approximate and quick solution. But often your best bet in that case is to use a heuristic algorithm designed specifically for your problem.

What is a fast simple solver for a large Laplacian matrix?

I need to solve some large (N~1e6) Laplacian matrices that arise in the study of resistor networks. The rest of the network analysis is being handled with boost graph and I would like to stay in C++ if possible. I know there are lots and lots of C++ matrix libraries but no one seems to be a clear leader in speed or usability. Also, the many questions on the subject, here and elsewhere seem to rapidly devolve into laundry lists which are of limited utility. In an attempt to help myself and others, I will try to keep the question concise and answerable:
What is the best library that can effectively handle the following requirements?
Matrix type: Symmetric Diagonal Dominant/Laplacian
Size: Very large (N~1e6), no dynamic resizing needed
Sparsity: Extreme (maximum 5 nonzero terms per row/column)
Operations needed: Solve for x in A*x=b and mat/vec multiply
Language: C++ (C ok)
Priority: Speed and simplicity to code. I would really rather avoid having to learn a whole new framework for this one problem or have to manually write too much helper code.
Extra love to answers with a minimal working example...
If you want to write your own solver, in terms of simplicity, it's hard to beat Gauss-Seidel iteration. The update step is one line, and it can be parallelized easily. Successive over-relaxation (SOR) is only slightly more complicated and converges much faster.
Conjugate gradient is also straightforward to code, and should converge much faster than the other iterative methods. The important thing to note is that you don't need to form the full matrix A, just compute matrix-vector products A*b. Once that's working, you can improve the convergance rate again by adding a preconditioner like SSOR (Symmetric SOR).
Probably the fastest solution method that's reasonable to write yourself is a Fourier-based solver. It essentially involves taking an FFT of the right-hand side, multiplying each value by a function of its coordinate, and taking the inverse FFT. You can use an FFT library like FFTW, or roll your own.
A good reference for all of these is A First Course in the Numerical Analysis of Differential Equations by Arieh Iserles.
Eigen is quite nice to use and one of the fastest libraries I know:
http://eigen.tuxfamily.org/dox/group__TutorialSparse.html
There is a lot of related post, you could have look.
I would recommend C++ and Boost::ublas as used in UMFPACK and BOOST's uBLAS Sparse Matrix

Is there an Integer Linear Programming software that returns also non-optimal solutions?

I have an integer linear optimisation problem and I'm interested in feasible, good solutions. As far as I know, for example the Gnu Linear Programming Kit only returns the optimal solution (given it exists).
This takes endless time and is not exactly what I'm looking for: I would be happy with any good solution, not only the optimal one.
So a LP-Solver that e.g. stops after some time and returns the best solution he found so far, would do the job.
Is there any such software? It would be great if that software was open source or at least free as in beer.
Alternatively: Is there any other way that usually speeds up Integer LP problems?
Is this the right place to ask?
Many solvers provide a time limit parameter; if you set the time limit parameter, they will stop once the time limit is reached. If an integer feasible solution has been found, it will return the best feasible solution found to that point.
As you may know, integer programming is NP-hard, and there is a real art to finding optimal solutions as well as good feasible solutions quickly. To compare the different solvers, see Prof. Hans Mittelmann's Benchmarks for Optimization Software. The MILP benchmarks - particularly MIPLIB2010 and the Feasibility Benchmark should be most relevant.
In addition to selecting a good solver, there are many things that can be done to improve solve times including tuning the parameters of the solver and model reformulation. Many people in research and industry - including myself - spend our careers working on improving the solve times of MIP models, both in general and for specific models.
If you are an academic user, note that the top commercial systems like CPLEX and Gurobi are free for academic use. See the respective websites for details.
Finally, you may want to look at OR-Exchange, a sister site to Stack Overflow that focuses on the field of operations research.
(Disclaimer: I currently work for Gurobi Optimization and formerly worked for ILOG, which provided CPLEX).
If you would like to get a feasibel integer solution fast and if you don't need the optimal solution, you can try
Increase the relative or absolute Gap. Usually solvers have small gaps of say 0.0001% for relative gap. This means that the solver will continue searching for MIP solutions until it the MIP solution is not farther than 0.0001% away from the optimal solution. Increase this gab to e.g. 1%., So you get good solution, but the solver will not spent a long time in proving optimality.
Try different values for solver parameters concerning MIP heuristics.
CPLEX and GUROBI have parameters to control, MIP emphasis. This means that the solver will put more emphasis on looking for feasible solutions or on proving optimality. Set emphasis to feasible MIP solutions.
Most solvers like CPLEX, Gurobi, MOPS or GLPK have settings for gap and heuristics. MIP emphasis can be set - as far as I know - only in CPLEX and Gurobi.
A usual approach for solving ILP is branch-and-bound. This utilized the solution of many sub-LP (without-I). The finally optimal result is the best of all sub-LP. As at least one solution is found you could stop anytime and would have a best-so-far.
One package that could do it, is the free lpsolve. Look there at set_timeout for giving a time limit, and when it is ILP the solve function can return in SUPOPTIMAL the best_so_far value.
As far as I know CPLEX can. It can return the solution pool which contains primal feasible solutions in the search, and if you specify the search focus on feasibility rather on optimality, more faesible solutions can be generated. At the end you can just export the pool. You can use the pool to do a hot start so it's pretty up to you. CPlex is free now at least in my country as you can sign up as a researcher.
Could you take into account Microsoft Solver Foundation? The only restriction is technology stack that you prefer and here you should use, as you guess, Microsoft technologies: C#, vb.net, etc. Here is example how to use it with Excel: http://channel9.msdn.com/posts/Modeling-with-Solver-Foundation-30 .
Regarding to your question it is possible to have not a fully optimized solutions if you set efficiency (for example 85% or 0.85). In outcome you can see all possible solutions for such restriction.

c++ numerical analysis Accurate data structure?

Using double type I made Cubic Spline Interpolation Algorithm.
That work was success as it seems, but there was a relative error around 6% when very small values calculated.
Is double data type enough for accurate scientific numerical analysis?
Double has plenty of precision for most applications. Of course it is finite, but it's always possible to squander any amount of precision by using a bad algorithm. In fact, that should be your first suspect. Look hard at your code and see if you're doing something that lets rounding errors accumulate quicker than necessary, or risky things like subtracting values that are very close to each other.
Scientific numerical analysis is difficult to get right which is why I leave it the professionals. Have you considered using a numeric library instead of writing your own? Eigen is my current favorite here: http://eigen.tuxfamily.org/index.php?title=Main_Page
I always have close at hand the latest copy of Numerical Recipes (nr.com) which does have an excellent chapter on interpolation. NR has a restrictive license but the writers know what they are doing and provide a succinct writeup on each numerical technique. Other libraries to look at include: ATLAS and GNU Scientific Library.
To answer your question double should be more than enough for most scientific applications, I agree with the previous posters it should like an algorithm problem. Have you considered posting the code for the algorithm you are using?
If double is enough for your needs depends on the type of numbers you are working with. As Henning suggests, it is probably best to take a look at the algorithms you are using and make sure they are numerically stable.
For starters, here's a good algorithm for addition: Kahan summation algorithm.
Double precision will be mostly suitable for any problem but the cubic spline will not work well if the polynomial or function is quickly oscillating or repeating or of quite high dimension.
In this case it can be better to use Legendre Polynomials since they handle variants of exponentials.
By way of a simple example if you use, Euler, Trapezoidal or Simpson's rule for interpolating within a 3rd order polynomial you won't need a huge sample rate to get the interpolant (area under the curve). However, if you apply these to an exponential function the sample rate may need to greatly increase to avoid loosing a lot of precision. Legendre Polynomials can cater for this case much more readily.

Least Squares Regression in C/C++

How would one go about implementing least squares regression for factor analysis in C/C++?
the gold standard for this is LAPACK. you want, in particular, xGELS.
When I've had to deal with large datasets and large parameter sets for non-linear parameter fitting I used a combination of RANSAC and Levenberg-Marquardt. I'm talking thousands of parameters with tens of thousands of data-points.
RANSAC is a robust algorithm for minimizing noise due to outliers by using a reduced data set. Its not strictly Least Squares, but can be applied to many fitting methods.
Levenberg-Marquardt is an efficient way to solve non-linear least-squares numerically.
The convergence rate in most cases is between that of steepest-descent and Newton's method, without requiring the calculation of second derivatives. I've found it to be faster than Conjugate gradient in the cases I've examined.
The way I did this was to set up the RANSAC an outer loop around the LM method. This is very robust but slow. If you don't need the additional robustness you can just use LM.
Get ROOT and use TGraph::Fit() (or TGraphErrors::Fit())?
Big, heavy piece of software to install just of for the fitter, though. Works for me because I already have it installed.
Or use GSL.
If you want to implement an optimization algorithm by yourself Levenberg-Marquard seems to be quite difficult to implement. If really fast convergence is not needed, take a look at the Nelder-Mead simplex optimization algorithm. It can be implemented from scratch in at few hours.
http://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method
Have a look at
http://www.alglib.net/optimization/
They have C++ implementations for L-BFGS and Levenberg-Marquardt.
You only need to work out the first derivative of your objective function to use these two algorithms.
I've used TNT/JAMA for linear least-squares estimation. It's not very sophisticated but is fairly quick + easy.
Lets talk first about factor analysis since most of the discussion above is about regression. Most of my experience is with software like SAS, Minitab, or SPSS, that solves the factor analysis equations, so I have limited experience in solving these directly. That said, that the most common implementations do not use linear regression to solve the equations. According to this, the most common methods used are principal component analysis and principal factor analysis. In a text on Applied Multivariate Analysis (Dallas Johnson), no less that seven methods are documented each with their own pros and cons. I would strongly recommend finding an implementation that gives you factor scores rather than programming a solution from scratch.
The reason why there's different methods is that you can choose exactly what you're trying to minimize. There a pretty comprehensive discussion of the breadth of methods here.