How to calculate the standard deviation with iterators and lambda functions - c++

After learning that one can calculate the mean of data, which is stored in a std::vector< std::vector<double> > data, can be done the following way:
void calculate_mean(std::vector<std::vector<double>>::iterator dataBegin,
std::vector<std::vector<double>>::iterator dataEnd,
std::vector<double>& rowmeans) {
auto Mean = [](std::vector<double> const& vec) {
return std::accumulate(vec.begin(), vec.end(), 0.0) / vec.size(); };
std::transform(dataBegin, dataEnd, rowmeans.begin(), Mean);
}
I made a function which takes the begin and the end of the iterator of the data vector to calculate the mean and std::vector<double> is where I store the result.
My first question is, how to handle the return value of function, when working with vectors. I mean in this case I make an Alias and modify in this way the vector I initialized before calling this function, so there is no copying back which is nice. So is this good programming practice?
Second my main questions is, how to adapt this function so one can calculate the standard deviation of each row in a similar way. I tried really hard but it only gives a huge mess, where nothing is working properly. So if someone sees it right away how to do that, I would be glad, for a insight. Thank you.
Edit: Solution
So here is my solution for the problem. Given a std::vector< vector<double> > data (rows, std::vector<double>(columns)), where the data is stored in the rows. The following function calculates the sample standard deviation of each row simultaneously.
auto begin = data.begin();
auto end = data.end();
std::vector<double> std;
std.resize(data.size());
void calculate_std(std::vector<std::vector<double>>::iterator dataBegin,
std::vector<std::vector<double>>::iterator dataEnd,
std::vector<double>& rowstds){
auto test = [](std::vector<double> const& vec) {
double sum = std::accumulate(vec.begin(), vec.end(), 0.0);
double mean = sum / vec.size();
double stdSum = 0.0;
auto Std = [&](const double x) { stdSum += (x - mean) * (x - mean); };
std::for_each(vec.begin(), vec.end(), Std);
return sqrt(stdSum / (vec.size() - 1));
};
std::transform(dataBegin, dataEnd, rowstds.begin(), test);
}
I tested it and it works just fine. So if anyone has some suggestions for improvement, please let me know. And is this piece of code good performance wise?

You will find relatively often the convention to write functions with input parameters first, followed by input / output parameters.
Output parameters (that you write to with the return values of your function) are often a pointer to the data, or a reference.
So your solution seems perfect, from that point of view.
Source:
Google's C++ coding conventions

I mean in this case I make an Alias and modify in this way the vector I initialized before calling this function, so there is no copying back which is nice. So is this good programming practice?
No, you should use a local vector<double> variable and return by value. Any compiler worth using would optimize away the copying/moving, and any conforming C++11 compiler is required to perform a move if for whatever reason it cannot elide the copy/move altogether.
Your code as written imposes additional requirements on the caller that are not obvious. For instance, rowmeans must contain enough elements to store the means, or undefined behavior results.

Related

Zero overhead subscript operator for a set of values

Assume we have a function with the following signature (the signature may not be changed, since this function is part of a legacy API):
void Foo(const std::string& s, float v0, float v1, float v2)
{ ... }
How can one access the last three arguments by index using the subscript operator [] without actually copying the data into some sort of container?
Regularly when I come across this kind of issue I put the values in a container, like const std::array<float,3> args{v0,v1,v2}; and access these values using args[0], which unfortunately needs to copy the values.
Another idea would be to access the arguments using a parameter pack, which in turn involves the creation of a templated function which seems to be overkill for this task.
I'm aware that the version using the std::array<> might be suitable since the compiler probably will optimize this kind of stuff, however, this question is kind of academically motivated.
You can't. Not in a way that guarantees zero overhead, or overhead similar to that of array subscripting.
You could, of course, do something like float* vs[]{&v0, &v1, &v2};, and then dereference the result of vs[i]. For that matter, you could make a utility class to act as a transparent reference (to try to get around arrays of references being illegal), though the result is inevitably limited.
The ultimate problem, though, is that nothing in the standard guarantees (or even suggests) that function arguments be stored in any particular memory ordering. On most platforms, at least one of those floats is going to be in a register, meaning that there's just no way to natively subscript it.
If a group of objects does not start out as an array, it's not possible to treat them as an array.
Another idea would be to access the arguments using a parameter pack, which in turn involves the creation of a templated function which seems to be overkill for this task.
Not necessarily. One thing you can do is use std::tie to build a std::tuple of references to the function parameters and then access that tuple via std::get. That should optimize out, but let you refer to the parameters as if they are part of a single collection. That would look like
void Foo(const std::string& s, float v0, float v1, float v2)
{
auto args = std::tie(v0, v1, v2);
std::cout << std::get<1>(args);
}
It's not using operator [], and requires your indices be compile time constants, but you can now pass them to something else as one object.
Danger Wil Robinson! Danger!
This is going to be horribly implementation dependent, and an all around bad idea! This relies on undefined behavior. Less awful with set hardware and tools, but less as in "we're only going to eat 5 babies, not a full dozen".
Those three floats are on the stack next to each other. I don't know if there are any packing rules for the stack. I don't know which order on the the stack they'll be ("v0 v1 v2" vs "v2 v1 v0"). Hell, some optimized build might even put them in a different order just to optimize some oddball case that doesn't actually come up in real life. I dunno. But I suspect something like this will work.
void Foo(const std::string& s, float v0, float v1, float v2)
{
float* vp = &v2;
for (int i = 0; i < 3; ++i)
{
printf("%f\n", vp[i]);
}
}
void main(void)
{
Foo("", 1.0f, 2.0f, 3.0f);
}
3.0000
2.0000
1.0000
So it is possible. It's also ugly, vile, evil, and probably both fattening and carcinogenic.
On GodBolt.org, using gcc x86-64 9.3, the above code worked fine. In VS2017 intel/64, I had to use float* vp = &v0 and for (int i = 0; i < 5; i += 2). Different alignment, different order, and different output (1, 2, 3, not 3, 2 1).
I'm pretty sure I just consigned my soul to the Nth circle of hell.

C++ functor advantage - holding the state [duplicate]

This question already has answers here:
What are C++ functors and their uses?
(14 answers)
Closed 8 years ago.
I did study the whole idea of functors, unfortunately I can't understand the real advantage of functors over typical functions.
According to some academic scripts, functors can hold state unlike functions.
Can anyone elaborate on this one with some simple, easy to understand example ?
I really can't understand why typical, regular function are not able to do the same. I'm really sorry for this kind of novice question.
As a really trivial demonstration, let's consider a Quick sort. We choose a value (usually known as the "pivot") and separate the input collection into those that compare less than the pivot, and those that compare greater than or equal to the pivot1.
The standard library already has std::partition that can do the partitioning itself--separate a collection into those items that satisfy a specified condition, and those that don't. So, to do our partitioning, we just have to supply a suitable predicate.
In this case, we need a simple comparison something like: return x < pivot;. Passing the pivot value every time becomes difficult though. std::partition just passes a value from the collection and asks: "does this pass your test or not?" There's no way for you to tell std::partition what the current pivot value is, and have it pass that to your routine when it's invoked. That could be done, of course (e.g., many enumeration functions in Windows work this way), but it gets pretty clumsy.
When we invoke std::partition we've already chosen the pivot value. What we want is a way to...bind that value to one of the parameters that will be passed to the comparison function. One really ugly way to do that would be to "pass" it via a global variable:
int pivot;
bool pred(int x) { return x < pivot; }
void quick_sort(int *begin, int *end) {
if (end - begin < 2)
return;
pivot = choose_pivot(begin, end);
int *pos = std::partition(begin, end, pred);
quick_sort(begin, pos);
quick_sort(pos, end);
}
I really hope I don't have to point out that we'd rather not use a global for this if we can help it. One fairly easy way to avoid it is to create a function object. We pass the current pivot value when we create the object, and it stores that value as state in the object:
class pred {
int pivot;
public:
pred(int pivot) : pivot(pivot) {}
bool operator()(int x) { return x < pivot; }
};
void quick_sort(int *begin, int *end) {
if (end-begin < 2)
return;
int pivot = choose_pivot(begin, end);
int *pos = std::partition(begin, end, pred(pivot));
quick_sort(begin, pos);
quick_sort(pos, end);
}
This has added a tiny bit of extra code, but in exchange we've eliminated a global--a fairly reasonable exchange.
Of course, as of C++11 we can do quite a bit better still--the language added "lambda expressions" that can create a class pretty much like that for us. Using this, our code looks something like this:
void quick_sort(int *begin, int *end) {
if (end-begin < 2)
return;
int pivot = find_pivot(begin, end);
auto pos = std::partition(begin, end, [pivot](int x) { return x < pivot; });
quick_sort(begin, pos);
quick_sort(pos, end);
}
This changes the syntax we use to specify the class/create the function object, but it's still pretty much the same basic idea as the preceding code: the compiler generates a class with a constructor and an operator(). The values we enclose in the square brackets are passed to the constructor, and the (int x) { return x < pivot; } basically becomes the body of the operator() for that class2.
This makes code much easier to write and much easier to read--but it doesn't change the basic fact that we're creating an object, "capturing" some state in the constructor, and using an overloaded operator() for the comparison.
Of course, a comparison just happens to be what we need for things like sorting. It is a common use of lambda expressions and function objects more generally, but we're certainly not restricted to it. Just for another example, let's consider "normalizing" a collection of doubles. We want to find the largest one, then divide every value in the collection by that, so each item is in the range 0.0 to 1.0, but all retaining the same ratios to each other as they previously had:
double largest = * std::max_element(begin, end);
std::for_each(begin, end, [largest](double d) { return d/largest; });
Here again we have pretty much the same pattern: create a function object that stores some relevant state, then repeatedly apply that function object's operator() to do the real work.
We could separate into less than or equal to, and greater than instead. Or we could create three groups: less than, equal to, greater than. The latter can improve efficiency in the presence of many duplicates, but for the moment we really don't care.
There's a lot more to know about lambda expressions than just this--I'm simplifying some things, and completely ignoring others that we don't care about at the moment.

Im passing a vector to a function as a double but visual studio is calling it an unsigned int

I'm very new to programming and C++. I'm trying to pass a vector to a function to sort it, and return its highest value. I seem to be able to get the right answer with the code I have, but it truncates the decimal places. Here's my code:
double findMaxValue(vector<double> v)
{
sort(v.begin(), v.end());
int lastIndex = v.size() - 1;
double maxValue = v[lastIndex];
return maxValue;
}
I found some similar questions on here, but they all seem to be far above my level of knowledge. Any help would be greatly appreciated.
The truncation happens outside of the code of your function. This may also be related to printing of the value in the code that calls findMaxValue.
Here are a few notes on the implementation:
You can pass v by reference to avoid copying
You can make vector<double> a const because you do not modify it inside findMaxValue
You can use v.back() instead of v[lastIndex]. If you make v a const, you will need to use v.back() to avoid errors.
Unless this is a learning exercise that requires you to write your own findMaxValue, *std::max_element(v.begin(), v.end()) will do the same thing on non-empty vectors.

Is it possible to use boost accumulators with vectors?

I wanted to use boost accumulators to calculate statistics of a variable that is a vector. Is there a simple way to do this. I think it's not possible to use the dumbest thing:
using namespace boost::accumulators;
//stuff...
accumulator_set<vector<double>, stats<tag::mean> > acc;
vector<double> some_vetor;
//stuff
some_vector = doStuff();
acc(some_vector);
maybe this is obvious, but I tried anyway. :P
What I wanted was to have an accumulator that would calculate a vector which is the mean of the components of many vectors. Is there an easy way out?
EDIT:
I don't know if I was thoroughly clear. I don't want this:
for_each(vec.begin(), vec.end(),acc);
This would calculate the mean of the entries of a given vector. What I need is different. I have a function that will spit vectors:
vector<double> doSomething();
// this is a monte carlo simulation;
And I need to run this many times and calculate the vectorial mean of those vectors:
for(int i = 0; i < numberOfMCSteps; i++){
vec = doSomething();
acc(vec);
}
cout << mean(acc);
And I want mean(acc) to be a vector itself, whose entry [i] would be the means of the entries [i] of the accumulated vectors.
Theres a hint about this in the docs of Boost, but nothing explicit. And I'm a bit dumb. :P
I've looked into your question a bit, and it seems to me that Boost.Accumulators already provides support for std::vector. Here is what I could find in a section of the user's guide :
Another example where the Numeric
Operators Sub-Library is useful is
when a type does not define the
operator overloads required to use it
for some statistical calculations.
For instance, std::vector<> does not overload any arithmetic operators, yet
it may be useful to use std::vector<>
as a sample or variate type. The
Numeric Operators Sub-Library defines
the necessary operator overloads in
the boost::numeric::operators
namespace, which is brought into scope
by the Accumulators Framework with a
using directive.
Indeed, after verification, the file boost/accumulators/numeric/functional/vector.hpp does contain the necessary operators for the 'naive' solution to work.
I believe you should try :
Including either
boost/accumulators/numeric/functional/vector.hpp before any other accumulators header
boost/accumulators/numeric/functional.hpp while defining BOOST_NUMERIC_FUNCTIONAL_STD_VECTOR_SUPPORT
Bringing the operators into scope with a using namespace boost::numeric::operators;.
There's only one last detail left : execution will break at runtime because the initial accumulated value is default-constructed, and an assertion will occur when trying to add a vector of size n to an empty vector. For this, it seems you should initialize the accumulator with (where n is the number of elements in your vector) :
accumulator_set<std::vector<double>, stats<tag::mean> > acc(std::vector<double>(n));
I tried the following code, mean gives me a std::vector of size 2 :
int main()
{
accumulator_set<std::vector<double>, stats<tag::mean> > acc(std::vector<double>(2));
const std::vector<double> v1 = boost::assign::list_of(1.)(2.);
const std::vector<double> v2 = boost::assign::list_of(2.)(3.);
const std::vector<double> v3 = boost::assign::list_of(3.)(4.);
acc(v1);
acc(v2);
acc(v3);
const std::vector<double> &meanVector = mean(acc);
}
I believe this is what you wanted ?
I don't have it set up to try right now, but if all boost::accumulators need is properly defined mathematical operators, then you might be able to get away with a different vector type: http://www.boost.org/doc/libs/1_37_0/libs/numeric/ublas/doc/vector.htm
And what about the documentation?
// The data for which we wish to calculate statistical properties:
std::vector< double > data( /* stuff */ );
// The accumulator set which will calculate the properties for us:
accumulator_set< double, features< tag::min, tag::mean > > acc;
// Use std::for_each to accumulate the statistical properties:
acc = std::for_each( data.begin(), data.end(), acc );

Initializing a C++ vector to random values... fast

Hey, id like to make this as fast as possible because it gets called A LOT in a program i'm writing, so is there any faster way to initialize a C++ vector to random values than:
double range;//set to the range of a particular function i want to evaluate.
std::vector<double> x(30, 0.0);
for (int i=0;i<x.size();i++) {
x.at(i) = (rand()/(double)RAND_MAX)*range;
}
EDIT:Fixed x's initializer.
Right now, this should be really fast since the loop won't execute.
Personally, I'd probably use something like this:
struct gen_rand {
double range;
public:
gen_rand(double r=1.0) : range(r) {}
double operator()() {
return (rand()/(double)RAND_MAX) * range;
}
};
std::vector<double> x(num_items);
std::generate_n(x.begin(), num_items, gen_rand());
Edit: It's purely a micro-optimization that might make no difference at all, but you might consider rearranging the computation to get something like:
struct gen_rand {
double factor;
public:
gen_rand(double r=1.0) : factor(range/RAND_MAX) {}
double operator()() {
return rand() * factor;
}
};
Of course, there's a really good chance the compiler will already do this (or something equivalent) but it won't hurt to try it anyway (though it's really only likely to help with optimization turned off).
Edit2: "sbi" (as is usually the case) is right: you might gain a bit by initially reserving space, then using an insert iterator to put the data into place:
std::vector<double> x;
x.reserve(num_items);
std::generate_n(std::back_inserter(x), num_items, gen_rand());
As before, we're into such microscopic optimization, I'm not at all sure I'd really expect to see a difference at all. In particular, since this is all done with templates, there's a pretty good chance most (if not all) the code will be generated inline. In that case, the optimizer is likely to notice that the initial data all gets overwritten, and skip initializing it.
In the end, however, nearly the only part that's really likely to make a significant difference is getting rid of the .at(i). The others might, but with optimizations turned on, I wouldn't really expect them to.
I have been using Jerry Coffin's functor method for some time, but with the arrival of C++11, we have loads of cool new random number functionality. To fill an array with random float values we can now do something like the following . . .
const size_t elements = 300;
std::vector<float> y(elements);
std::uniform_real_distribution<float> distribution(0.0f, 2.0f); //Values between 0 and 2
std::mt19937 engine; // Mersenne twister MT19937
auto generator = std::bind(distribution, engine);
std::generate_n(y.begin(), elements, generator);
See the relevant section of Wikipedia for more engines and distributions
Yes, whereas x.at(i) does bounds checking, x[i] does not do so. Also, your code is incorrect as you have failed to specify the size of x in advance. You need to use std::vector<double> x(n), where n is the number of elements that you want to use; otherwise, your loop there will never execute.
Alternatively, you may want to make a custom iterator for generating random values and filling it using the iterator; because the std::vector constructor will initialize its elements, anyway, so if you have a custom iterator class that generates random values you may be able to eliminate a pass over the items.
In terms of implementing an iterator of your own, here is my untested code:
class random_iterator
{
public:
typedef std::input_iterator_tag iterator_category;
typedef double value_type;
typedef int difference_type;
typedef double* pointer;
typedef double& reference;
random_iterator() : _range(1.0), _count(0) {}
random_iterator(double range, int count) :
_range(range), _count(count) {}
random_iterator(const random_iterator& o) :
_range(o._range), _count(o._count) {}
~random_iterator(){}
double operator*()const{ return ((rand()/(double)RAND_MAX) * _range); }
int operator-(const random_iterator& o)const{ return o._count-_count; }
random_iterator& operator++(){ _count--; return *this; }
random_iterator operator++(int){ random_iterator cpy(*this); _count--; return cpy; }
bool operator==(const random_iterator& o)const{ return _count==o._count; }
bool operator!=(const random_iterator& o)const{ return _count!=o._count; }
private:
double _range;
int _count;
};
With the code above, it should be possible to use:
std::vector<double> x(random_iterator(range,number),random_iterator());
That said, the generate code for the other solution given is simpler, and frankly, I would just explicitly fill the vector without resorting to anything fancy like this.... but it is kind of cool to think about.
#include <iostream>
#include <vector>
#include <algorithm>
struct functor {
functor(double v):val(v) {}
double operator()() const {
return (rand()/(double)RAND_MAX)*val;
}
private:
double val;
};
int main(int argc, const char** argv) {
const int size = 10;
const double range = 3.0f;
std::vector<double> dvec;
std::generate_n(std::back_inserter(dvec), size, functor(range));
// print all
std::copy(dvec.begin(), dvec.end(), (std::ostream_iterator<double>(std::cout, "\n")));
return 0;
}
опоздал :(
You may consider using a pseudo-random number generator that gives output as a sequence. Since most PRNGs just provide a sequence anyways, that will be a lot more efficient than simply calling rand() over and over again.
But then, I think I really need to know more about your situation.
Why does this piece of code execute so much? Can you restructure your code to avoid re-generating random data so frequently?
How big are your vectors?
How "good" does your random number generator need to be? High-quality distributions tend to be more expensive to calculate.
If your vectors are large, are you reusing their buffer space, or are you throwing it away and reallocating it elsewhere? Creating new vectors willy-nilly is a great way to destroy your cache.
#Jerry Coffin's answer looks very good. Two other thoughts, though:
Inlining - All of your vector access will be very fast, but if the call to rand() is out-of-line, the function call overhead might dominate. If that's the case, you may need to roll your own pseudorandom number generator.
SIMD - If you're going to roll your own PRNG, you might as well make it compute 2 doubles (or 4 floats) at once. This will reduce the number of the int-to-float conversions as well as the multiplications. I've never tried it, but apparently there's a SIMD version of the Mersenne Twister that's quite good. A simple linear congruential generator might be good enough too (and that's probably what rand() is using already).
int main() {
int size = 10;
srand(time(NULL));
std::vector<int> vec(size);
std::generate(vec.begin(), vec.end(), rand);
std::vector<int> vec_2(size);
std::generate(vec_2.begin(), vec_2.end(), [](){ return rand() % 50;})
}
You need to include vector, algorithm, time, cstdlib.
The way I think about these is a rubber-meets-the-road approach.
In other words, there are certain minimal things that have to happen, no getting around it, such as:
the rand() function has to be called N times.
the result of rand() has to be converted to double and then multiplied by something.
the resulting numbers have to get stored in consecutive elements of an array.
The object is, at a minimum, to get those things done.
Other concerns, like whether or not to use an std::vector and iterators are fine as long as they don't add any extra cycles.
The easiest way to see if they add significant extra cycles is to single-step the code at the assembly language level.