Finding values over (µ + 3 sigma) with Boost::accumulators

Finding values over (µ + 3 sigma) with Boost::accumulators - c++

here is my problem: I have a 2D matrix of doubles containing data. The data is gaussian and and i need to find out which datapoints are the extrem ones. As a first estimation, values > (µ + 3 sigma) should be okay. Just to be sure whether i'm corret with doing the following:
I can add the data to the accumulator, i'm able to calculate the µ, but how can i get the f** sigma?

you can get mean and moment from accumulator:
#include <iostream>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics/stats.hpp>
#include <boost/accumulators/statistics/mean.hpp>
#include <boost/accumulators/statistics/moment.hpp>
using namespace boost::accumulators;
int main()
{
// Define an accumulator set for calculating the mean and the
// 2nd moment ...
accumulator_set<double, stats<tag::mean, tag::moment<2> > > acc;
// push in some data ...
acc(1.2);
acc(2.3);
acc(3.4);
acc(4.5);
// Display the results ...
std::cout << "Mean: " << mean(acc) << std::endl;
std::cout << "Moment: " << accumulators::moment<2>(acc) << std::endl;
return 0;
}
However in the boost docs we read that this is raw moment (not central):
Calculates the N-th moment of the samples, which is defined as the sum
of the N-th power of the samples over the count of samples.
so you need to adjust this and here is how to do it (you need sqrt of second central moment, mi_2).
http://en.wikipedia.org/wiki/Moment_(mathematics)

Related

Cgal - Point_set_3 object, functions fails to process?

Currently trying to build upon the surface reconstruction tutorial and noticed a potential major issue in the tutorial which, in my experience do generalised outside of it:
In the following tutorial: https://doc.cgal.org/latest/Manual/tuto_reconstruction.html (cgal 5.3), the author do some pre-processing before going into the mesh reconstruction, stuffs like outlier_removal, grid_simplify etc.
However I noticed that no points are being removed during these steps. So I tried multiple parameters in the outlier_removal/grid_simplify and still, everytime, no points gets removed.
However when working with a vector of point instead of a Point_set_3 object, I do manage to get points removed with the same parameters.
Am I the only one who is unable to remove a point with outlier_removal/grid_simplify on a Point_set_3 object?
If yes, can you show me an example how to make it work?
If no, should I avoid using Point_set_3 objects? Or should I convert into a std::vector before doing the pre-processing steps? And how so?
Issue Details
The code runs fine. No errors.
Source Code
This code subset comes straight out of the tutorial.
https://doc.cgal.org/latest/Manual/tuto_reconstruction.html
#include <CGAL/Exact_predicates_inexact_constructions_kernel.h>
#include <CGAL/Point_set_3.h>
#include <CGAL/Point_set_3/IO.h>
#include <CGAL/remove_outliers.h>
#include <CGAL/grid_simplify_point_set.h>
#include <CGAL/jet_smooth_point_set.h>
#include <CGAL/jet_estimate_normals.h>
#include <CGAL/mst_orient_normals.h>
#include <CGAL/poisson_surface_reconstruction.h>
#include <CGAL/Advancing_front_surface_reconstruction.h>
#include <CGAL/Scale_space_surface_reconstruction_3.h>
#include <CGAL/Scale_space_reconstruction_3/Jet_smoother.h>
#include <CGAL/Scale_space_reconstruction_3/Advancing_front_mesher.h>
#include <CGAL/Surface_mesh.h>
#include <CGAL/Polygon_mesh_processing/polygon_soup_to_polygon_mesh.h>
#include <cstdlib>
#include <vector>
#include <fstream>
// types
typedef CGAL::Exact_predicates_inexact_constructions_kernel Kernel;
typedef Kernel::FT FT;
typedef Kernel::Point_3 Point_3;
typedef Kernel::Vector_3 Vector_3;
typedef Kernel::Sphere_3 Sphere_3;
typedef CGAL::Point_set_3<Point_3, Vector_3> Point_set;
int main(int argc, char*argv[])
{
Point_set points;
if (argc < 2)
{
std::cerr << "Usage: " << argv[0] << " [input.xyz/off/ply/las]" << std::endl;
return EXIT_FAILURE;
}
const char* input_file = argv[1];
std::ifstream stream (input_file, std::ios_base::binary);
if (!stream)
{
std::cerr << "Error: cannot read file " << input_file << std::endl;
return EXIT_FAILURE;
}
stream >> points;
std::cout << "Read " << points.size () << " point(s)" << std::endl;
if (points.empty())
return EXIT_FAILURE;
CGAL::remove_outliers<CGAL::Sequential_tag>
(points,
24, // Number of neighbors considered for evaluation
points.parameters().threshold_percent (5.0)); // Percentage of points to remove
std::cout << points.number_of_removed_points()
<< " point(s) are outliers." << std::endl;
// Applying point set processing algorithm to a CGAL::Point_set_3
// object does not erase the points from memory but place them in
// the garbage of the object: memory can be freeed by the user.
points.collect_garbage();
// Compute average spacing using neighborhood of 6 points
double spacing = CGAL::compute_average_spacing<CGAL::Sequential_tag> (points, 6);
// Simplify using a grid of size 2 * average spacing
CGAL::grid_simplify_point_set (points, 2. * spacing);
std::cout << points.number_of_removed_points()
<< " point(s) removed after simplification." << std::endl;
points.collect_garbage();
CGAL::jet_smooth_point_set<CGAL::Sequential_tag> (points, 24);
unsigned int reconstruction_choice
= (argc < 3 ? 0 : atoi(argv[2]));
if (reconstruction_choice == 0) // Poisson
{
CGAL::jet_estimate_normals<CGAL::Sequential_tag>
(points, 24); // Use 24 neighbors
// Orientation of normals, returns iterator to first unoriented point
typename Point_set::iterator unoriented_points_begin =
CGAL::mst_orient_normals(points, 24); // Use 24 neighbors
points.remove (unoriented_points_begin, points.end());
return EXIT_SUCCESS;
}
Environment
I've replicated that issue in a debian VM as well as in a docker environment in macos (debian based as well).
Pretty standard stuffs, I'm using the CMakeLists.txt already available in the tutorial_example.cpp folder and running:
Creates files that will show the compiler how to behave
cmake -DCGAL_DIR=/app/cgal -DCMAKE_BUILD_TYPE=Release .
Build the exe
make
I'm a self taught Python programmer so quite new to the C++ stuffs.

Is there a limit on the number of values added to a boost::accumulator?

Is there a limit on how many values that can be added to a boost::accumulator? If a large number of entries were added, is there any point in which the accumulator would cease to work properly or is the internal algorithm robust enough to account for a set of values approaching infinity?
#include <iostream>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics/stats.hpp>
#include <boost/accumulators/statistics/mean.hpp>
#include <boost/accumulators/statistics/moment.hpp>
using namespace boost::accumulators;
int main()
{
// Define an accumulator set for calculating the mean and the
// 2nd moment ...
accumulator_set<double, stats<tag::mean, tag::moment<2> > > acc;
// push in some data ...
for (std::size_t i=0; i<VERY_LARGE_NUMBER; i++)
{
acc(i);
}
// Display the results ...
std::cout << "Mean: " << mean(acc) << std::endl;
std::cout << "Moment: " << moment<2>(acc) << std::endl;
return 0;
}

If your int is a 32 bit integer, you'll get a signed integer overflow at 46341 * 46341 when calculating moment<2> and your program therefore has undefined behavior.
To avoid that, cast i to the type you're using in the accumulator:
acc(static_cast<double>(i));
This will now have the same limits as a normal double. You can add as many elements as you'd like to it as long as you don't exceed the limit (std::numeric_limits<double>::max()) for a double in the internal moment calculations (x2 for moment<2> or a sum that exceeds the limit).

The accumulator statistics do not account for overflow, so you need to select the accumulator type carefully. It doesn't need to match the initial type of the objects you are adding—you can cast it when accumulating, then get the statistics and cast it back to the original type.
You can see it with this simple example:
#include <bits/stdc++.h>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics.hpp>
using namespace boost::accumulators;
int main(void) {
accumulator_set<int8_t, features<tag::mean>> accInt8;
accumulator_set<double, features<tag::mean>> accDouble;
int8_t sum = 0; // range of int8_t: -128 to 127
for (int8_t i = 1; i <= 100; i++) {
sum += i; // this will overflow!
accInt8(i); // this will also overflow
accDouble((double)i);
}
std::cout << "sum from 1 to 100: " << (int)sum << " (overflow)\n";
std::cout << "mean(<int8_t>): " << extract::mean(accInt8) << " (overflow)\n";
std::cout << "mean(<double>): " << (int)extract::mean(accDouble) << "\n";
return 0;
}
I used int8_t which has a very small range (-128 to 127) to demonstrate that getting the mean from values 1 to 100 (which should be 50) overflows if you use int8_t as the internal type for the accumulator_set.
The output is:
sum from 1 to 100: -70 (overflow)
mean(<int8_t>): -7 (overflow)
mean(<double>): 50

Align decimal places in output?

#include <iomanip>
#include <cmath>
#include <iostream>
using namespace std;
int main() {
//
//HERE IS THE ISSUE
//set precision to 3 decimals
cout<<fixed;
//printing the final pressure of the gas
cout <<setw(20)<<left<<setfill('.')<<"Equation #01"<<"Ideal Gas Law(Chemistry): "<<setw(5)<<setprecision(3)<<gaslawPressure<<" atm"
<<endl;
//printing the calculated distance
cout <<setw(20)<<left<<setfill('.')<<"Equation #02"<<"Distance Formula(Math): "<<setw(5)<<setprecision(3)<<pointDistance<<endl;
return 0;
}
Output given:
Equation #01........Ideal Gas Law(Chemistry): 1.641 atm
Equation #02........Distance Formula(Math): 30.017
Output desired:
Equation #01........Ideal Gas Law(Chemistry): 1.641 atm
Equation #02........Distance Formula(Math) : 30.017
I also need to have the colons align as such.

You will have to put proper setw in different parts as well as left align based on your text
1) First part
setw(20)<<left<<setfill('.')<<"Equation #01"
2) Second part assume it to be of approx length 30
setw(30)<<left<<setfill(' ')<<"Ideal Gas Law(Chemistry)"
3) To align colon :
setw(3)<<left<<setfill(' ')<<":"
4) value part
setw(5)<<std::left<<setprecision(3)<<gaslawPressure<<" atm"
#include <iomanip>
#include <cmath>
#include <iostream>
using namespace std;
int main() {
//
//HERE IS THE ISSUE
//set precision to 3 decimals
auto gaslawPressure = 1.641;
auto pointDistance = 30.017;
cout<<fixed;
//printing the final pressure of the gas
cout <<setw(20)<<left<<setfill('.')<<"Equation #01"<<setw(30)<<left<<setfill(' ')<<"Ideal Gas Law(Chemistry)"<<setw(3)<<left<<setfill(' ')<<":"<<setw(5)<<std::left<<setprecision(3)<<gaslawPressure<<" atm"<<endl;
//printing the calculated distance
cout <<std::left<<setw(20)<<left<<setfill('.')<<"Equation #02"<<setw(30)<<left<<setfill(' ')<<"Distance Formula(Math)"<<setw(3)<<left<<setfill(' ')<<":"<<setw(5)<<setprecision(3)<<pointDistance<<endl;
return 0;
}
output
Equation #01........Ideal Gas Law(Chemistry) : 1.641 atm
Equation #02........Distance Formula(Math) : 30.017
Program ended with exit code: 0

UPDATE:
As I saw you didn't want just the second field to align. But if you are hard wiring the fields, you can format those yourself. If passed to you as strings, they can be handled with the same method as your doubles.
As you want to align the decimal points on the results, you have to do that yourself from what I understand. A helper structure keeps it out of the way and reusable.
#include <iomanip>
#include <cmath>
#include <iostream>
struct buf
{
double val;
buf(double val) :val(val) {}
friend std::ostream& operator<< (std::ostream& os, buf b) {
for (double i = b.val; i < 1000; i*=10) os << " ";
return os << b.val;
}
};
int main() {
//
double gaslawPressure = 1.615;
double pointDistance = 221.615;
std::cout << std::setw(20) << std::left << std::setfill('.')
<< "Equation #01" << "Ideal Gas Law(Chemistry) : " << buf(gaslawPressure)<<" atm" << std::endl;
//printing the calculated distance
std::cout << std::setw(20) << std::left << std::setfill('.')
<< "Equation #02" << "Distance Formula(Math) : "<< buf(pointDistance)<< std::endl;
return 0;
}
Output:
Equation #01........Ideal Gas Law(Chemistry) : 1.615 atm
Equation #02........Distance Formula(Math) : 221.615

As far as I know there is no fast way to do that using isstream/iomanip
Precision doesn't define length of fractional part but number of all digits.
I understand, you need to pad values correctly.
In this case, solution is sprintf from [cstdio].
It should look something like this:
sprintf(YourBuffer, "%10.3f", YourVariable);
https://en.cppreference.com/w/cpp/io/c/fprintf
http://www.cplusplus.com/reference/cstdio/printf/ - short version

computing lpNorm column wise in Eigen

When I try to call lpNorm<1> with colwise() in Eigen I get the error:
error: 'Eigen::DenseBase > >::ColwiseReturnType' has no member named 'lpNorm'
Instead norm() and squaredNorm() work fine calling them colwise.
example
#include <Eigen/Dense>
#include <iostream>
using namespace std;
using namespace Eigen;
int main()
{
MatrixXf m(2,2), n(2,2);
m << 1,-2,
-3,4;
cout << "m.colwise().squaredNorm() = " << m.colwise().squaredNorm() << endl;
cout << "m.lpNorm<1>() = " << m.lpNorm<1>() << endl;
// cout << "m.colwise().lpNorm<1>() = " << m.colwise().lpNorm<1>() << endl;
}
works fine giving
m.colwise().squaredNorm() = 10 20
m.lpNorm<1>() = 10
If I uncomment the last line I get the error.
Can someone help?

It is not implemented for colwise in Eigen <=3.2.9. You have two options:
Upgrade to Eigen 3.3 (beta)
Loop over all columns and calculate the lp norms one by one.

You may by-pass it that way:
m.cwiseAbs().colwise().sum()
Unfortunately it only works in case of L1 norm (which is equivalent of an absolute value).

Combinations of N Boost interval_set

I have a service which has outages in 4 different locations. I am modeling each location outages into a Boost ICL interval_set. I want to know when at least N locations have an active outage.
Therefore, following this answer, I have implemented a combination algorithm, so I can create combinations between elemenets via interval_set intersections.
Whehn this process is over, I should have a certain number of interval_set, each one of them defining the outages for N locations simultaneusly, and the final step will be joining them to get the desired full picture.
The problem is that I'm currently debugging the code, and when the time of printing each intersection arrives, the output text gets crazy (even when I'm using gdb to debug step by step), and I can't see them, resulting in a lot of CPU usage.
I guess that somehow I'm sending to output a larger portion of memory than I should, but I can't see where the problem is.
This is a SSCCE:
#include <boost/icl/interval_set.hpp>
#include <algorithm>
#include <iostream>
#include <vector>
int main() {
// Initializing data for test
std::vector<boost::icl::interval_set<unsigned int> > outagesPerLocation;
for(unsigned int j=0; j<4; j++){
boost::icl::interval_set<unsigned int> outages;
for(unsigned int i=0; i<5; i++){
outages += boost::icl::discrete_interval<unsigned int>::closed(
(i*10), ((i*10) + 5 - j));
}
std::cout << "[Location " << (j+1) << "] " << outages << std::endl;
outagesPerLocation.push_back(outages);
}
// So now we have a vector of interval_sets, one per location. We will combine
// them so we get an interval_set defined for those periods where at least
// 2 locations have an outage (N)
unsigned int simultaneusOutagesRequired = 2; // (N)
// Create a bool vector in order to filter permutations, and only get
// the sorted permutations (which equals the combinations)
std::vector<bool> auxVector(outagesPerLocation.size());
std::fill(auxVector.begin() + simultaneusOutagesRequired, auxVector.end(), true);
// Create a vector where combinations will be stored
std::vector<boost::icl::interval_set<unsigned int> > combinations;
// Get all the combinations of N elements
unsigned int numCombinations = 0;
do{
bool firstElementSet = false;
for(unsigned int i=0; i<auxVector.size(); i++){
if(!auxVector[i]){
if(!firstElementSet){
// First location, insert to combinations vector
combinations.push_back(outagesPerLocation[i]);
firstElementSet = true;
}
else{
// Intersect with the other locations
combinations[numCombinations] -= outagesPerLocation[i];
}
}
}
numCombinations++;
std::cout << "[-INTERSEC-] " << combinations[numCombinations] << std::endl; // The problem appears here
}
while(std::next_permutation(auxVector.begin(), auxVector.end()));
// Get the union of the intersections and see the results
boost::icl::interval_set<unsigned int> finalOutages;
for(std::vector<boost::icl::interval_set<unsigned int> >::iterator
it = combinations.begin(); it != combinations.end(); it++){
finalOutages += *it;
}
std::cout << finalOutages << std::endl;
return 0;
}
Any help?

As I surmised, there's a "highlevel" approach here.
Boost ICL containers are more than just containers of "glorified pairs of interval starting/end points". They are designed to implement just that business of combining, searching, in a generically optimized fashion.
So you don't have to.
If you let the library do what it's supposed to do:
using TimePoint = unsigned;
using DownTimes = boost::icl::interval_set<TimePoint>;
using Interval = DownTimes::interval_type;
using Records = std::vector<DownTimes>;
Using functional domain typedefs invites a higher level approach. Now, let's ask the hypothetical "business question":
What do we actually want to do with our records of per-location downtimes?
Well, we essentially want to
tally them for all discernable time slots and
filter those where tallies are at least 2
finally, we'd like to show the "merged" time slots that remain.
Ok, engineer: implement it!
Hmm. Tallying. How hard could it be?
❕ The key to elegant solutions is the choice of the right datastructure
using Tally = unsigned; // or: bit mask representing affected locations?
using DownMap = boost::icl::interval_map<TimePoint, Tally>;
Now it's just bulk insertion:
// We will do a tally of affected locations per time slot
DownMap tallied;
for (auto& location : records)
for (auto& incident : location)
tallied.add({incident, 1u});
Ok, let's filter. We just need the predicate that works on our DownMap, right
// define threshold where at least 2 locations have an outage
auto exceeds_threshold = [](DownMap::value_type const& slot) {
return slot.second >= 2;
};
Merge the time slots!
Actually. We just create another DownTimes set, right. Just, not per location this time.
The choice of data structure wins the day again:
// just printing the union of any criticals:
DownTimes merged;
for (auto&& slot : tallied | filtered(exceeds_threshold) | map_keys)
merged.insert(slot);
Report!
std::cout << "Criticals: " << merged << "\n";
Note that nowhere did we come close to manipulating array indices, overlapping or non-overlapping intervals, closed or open boundaries. Or, [eeeeek!] brute force permutations of collection elements.
We just stated our goals, and let the library do the work.
Full Demo
Live On Coliru
#include <boost/icl/interval_set.hpp>
#include <boost/icl/interval_map.hpp>
#include <boost/range.hpp>
#include <boost/range/algorithm.hpp>
#include <boost/range/adaptors.hpp>
#include <boost/range/numeric.hpp>
#include <boost/range/irange.hpp>
#include <algorithm>
#include <iostream>
#include <vector>
using TimePoint = unsigned;
using DownTimes = boost::icl::interval_set<TimePoint>;
using Interval = DownTimes::interval_type;
using Records = std::vector<DownTimes>;
using Tally = unsigned; // or: bit mask representing affected locations?
using DownMap = boost::icl::interval_map<TimePoint, Tally>;
// Just for fun, removed the explicit loops from the generation too. Obviously,
// this is bit gratuitous :)
static DownTimes generate_downtime(int j) {
return boost::accumulate(
boost::irange(0, 5),
DownTimes{},
[j](DownTimes accum, int i) { return accum + Interval::closed((i*10), ((i*10) + 5 - j)); }
);
}
int main() {
// Initializing data for test
using namespace boost::adaptors;
auto const records = boost::copy_range<Records>(boost::irange(0,4) | transformed(generate_downtime));
for (auto location : records | indexed()) {
std::cout << "Location " << (location.index()+1) << " " << location.value() << std::endl;
}
// We will do a tally of affected locations per time slot
DownMap tallied;
for (auto& location : records)
for (auto& incident : location)
tallied.add({incident, 1u});
// We will combine them so we get an interval_set defined for those periods
// where at least 2 locations have an outage
auto exceeds_threshold = [](DownMap::value_type const& slot) {
return slot.second >= 2;
};
// just printing the union of any criticals:
DownTimes merged;
for (auto&& slot : tallied | filtered(exceeds_threshold) | map_keys)
merged.insert(slot);
std::cout << "Criticals: " << merged << "\n";
}
Which prints
Location 1 {[0,5][10,15][20,25][30,35][40,45]}
Location 2 {[0,4][10,14][20,24][30,34][40,44]}
Location 3 {[0,3][10,13][20,23][30,33][40,43]}
Location 4 {[0,2][10,12][20,22][30,32][40,42]}
Criticals: {[0,4][10,14][20,24][30,34][40,44]}

At the end of the permutation loop, you write:
numCombinations++;
std::cout << "[-INTERSEC-] " << combinations[numCombinations] << std::endl; // The problem appears here
My debugger tells me that on the first iteration numCombinations was 0 before the increment. But incrementing it made it out of range for the combinations container (since that is only a single element, so having index 0).
Did you mean to increment it after the use? Was there any particular reason not to use
std::cout << "[-INTERSEC-] " << combinations.back() << "\n";
or, for c++03
std::cout << "[-INTERSEC-] " << combinations[combinations.size()-1] << "\n";
or even just:
std::cout << "[-INTERSEC-] " << combinations.at(numCombinations) << "\n";
which would have thrown std::out_of_range?
On a side note, I think Boost ICL has vastly more efficient ways to get the answer you're after. Let me think about this for a moment. Will post another answer if I see it.
UPDATE: Posted the other answer show casing highlevel coding with Boost ICL

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Finding values over (µ + 3 sigma) with Boost::accumulators - c++

Related

Cgal - Point_set_3 object, functions fails to process?

Is there a limit on the number of values added to a boost::accumulator?

Align decimal places in output?

computing lpNorm column wise in Eigen

Combinations of N Boost interval_set

Categories

Resources