Sort an Array using C++ in R - c++

I need to arrange a dataframe of prices, row by row in ascedent order.
But doing it on R for Loop is quite bad and slow.
A friend of mine tipped me to use Rcpp.
But I'm having quite a hard time to develop a looping in C++ that works.
#include <Rcpp.h>
// [[Rcpp::export]]
using namespace std;
List min(NumericVector x)
{
for (unsigned int i = 0; i < x.size(); i++) {
vector<int>& vec = x[i];
NumericVector Value sort(vec.begin(), vec.end());
}
Return Value;
}
I'm not used to C++ and i would like to know why it keeps saying that mys sort is wrong.
Arrange my dataframe by row.

Welcome (again) to StackOverflow and Rcpp! Two big worlds with much to discover...
sort() is available as a member function:
> Rcpp::cppFunction("NumericVector srt(NumericVector x) { return(x.sort()); }")
> srt(c(2,3,4,1.5,3.2))
[1] 1.5 2.0 3.0 3.2 4.0
>
Note that an advanced question is hidden inside this simple because the sort() member function sorts in place so the above mutates its input. That can be convenient ("hey, no new heap object to return") or confusing depending on your vantage point. We cover it in most Rcpp tutorials but you may have other more pressing issue. Keep on it!

Related

How to use algorithms to fill vector of vectors

I have
typedef std::vector<int> IVec;
typedef std::vector<IVec> IMat;
and I would like to know how I can fill an IMat by using std algorithms, ie how to do the following with less code (all the IVecs have the same size) ?
void fill(IMat& mat){
for (int i=0;i<mat.size();i++){
for (int j=0;j<mat[i].size();j++){
mat[i][j] = i*j;
}
}
}
PS: already a way to fill the matrix with a constant would help me. And preferably with pre-C++11 algorithms.
The best solution is the one that you have already implemented. It takes advantage of using i/j as both offsets and as inputs to compute the algorithm.
Standard algorithms will have to use iterators for the elements and maintain counters. This data mirroring as a sure sign of a problem. But it can be done, even on one line if you wanna be fancy:
for_each(mat.begin(), mat.end(), [&](auto& i) { static auto row = 0; auto column = 0; generate(i.begin(), i.end(), [&]() { return row * column++; }); ++row; });
But as stated just cause it could be done doesn't mean that it should be done. The best way to approach this is the for-loop. Even doing it on one line is possible if that's your thing:
for(auto i = 0U;i < mat.size();i++) for(auto j = 0U;j < mat[i].size();j++) mat[i][j] = i*j;
Incidentally my standard algorithm works fine on Clang 3.7.0, gcc 5.1, and on Visual Studio 2015. However previously I used transform rather than generate. And there seem to be some implementation bugs in gcc 5.1 and Visual Studio 2015 with the captures of lambda scope static variables.
I don't know if this is better than a double for loop, but one possible way you could do it using STL in C++11 would be using two for_each as follows:
int i(0);
std::for_each(mat.begin(), mat.end(),
[&i](IVec &ivec){int j(0); std::for_each(ivec.begin(), ivec.end(),
[&i,&j](auto &k){k = i*j++;}); ++i;});
LIVE DEMO
Just thought I'd comment further on Jonathan's excellent answer.
Ignore the c++11 syntax for now and imagine that we had written some supporting classes (doesn't matter how for now).
we could conceivably come up with code like this:
auto main() -> int
{
// define a matrix (vector of vectors)
IMat mat;
// resize it through some previously defined function
resize(mat, 10, 10);
// get an object that is a pseudo-container representing its extent
auto extent = extent_of(mat);
// generate values in the pseudo-container which forwards to the matrix
std::generate(extent.begin(),
extent.end(),
[](auto pxy) { pxy.set_value(pxy.x * pxy.y); });
// or even
for (auto pxy : extent_of(mat)) {
pxy.set_value(product(pxy.coordinates()));
}
return 0;
}
100 lines of supporting code later (iterable containers and their proxies are not trivial) and this would compile and work.
Clever as it undoubtedly would be, there are some problems:
There's the small matter of the 100 extra lines of code.
It seems to me that this code is actually less expressive than yours. i.e. it's immediately obvious what your code is doing. With mine you have to make some assumptions or go and reason about the extra 100 lines of code.
my code needs a lot more maintenance (and documentation) than yours
Sometimes less is more.

How to calculate the standard deviation with iterators and lambda functions

After learning that one can calculate the mean of data, which is stored in a std::vector< std::vector<double> > data, can be done the following way:
void calculate_mean(std::vector<std::vector<double>>::iterator dataBegin,
std::vector<std::vector<double>>::iterator dataEnd,
std::vector<double>& rowmeans) {
auto Mean = [](std::vector<double> const& vec) {
return std::accumulate(vec.begin(), vec.end(), 0.0) / vec.size(); };
std::transform(dataBegin, dataEnd, rowmeans.begin(), Mean);
}
I made a function which takes the begin and the end of the iterator of the data vector to calculate the mean and std::vector<double> is where I store the result.
My first question is, how to handle the return value of function, when working with vectors. I mean in this case I make an Alias and modify in this way the vector I initialized before calling this function, so there is no copying back which is nice. So is this good programming practice?
Second my main questions is, how to adapt this function so one can calculate the standard deviation of each row in a similar way. I tried really hard but it only gives a huge mess, where nothing is working properly. So if someone sees it right away how to do that, I would be glad, for a insight. Thank you.
Edit: Solution
So here is my solution for the problem. Given a std::vector< vector<double> > data (rows, std::vector<double>(columns)), where the data is stored in the rows. The following function calculates the sample standard deviation of each row simultaneously.
auto begin = data.begin();
auto end = data.end();
std::vector<double> std;
std.resize(data.size());
void calculate_std(std::vector<std::vector<double>>::iterator dataBegin,
std::vector<std::vector<double>>::iterator dataEnd,
std::vector<double>& rowstds){
auto test = [](std::vector<double> const& vec) {
double sum = std::accumulate(vec.begin(), vec.end(), 0.0);
double mean = sum / vec.size();
double stdSum = 0.0;
auto Std = [&](const double x) { stdSum += (x - mean) * (x - mean); };
std::for_each(vec.begin(), vec.end(), Std);
return sqrt(stdSum / (vec.size() - 1));
};
std::transform(dataBegin, dataEnd, rowstds.begin(), test);
}
I tested it and it works just fine. So if anyone has some suggestions for improvement, please let me know. And is this piece of code good performance wise?
You will find relatively often the convention to write functions with input parameters first, followed by input / output parameters.
Output parameters (that you write to with the return values of your function) are often a pointer to the data, or a reference.
So your solution seems perfect, from that point of view.
Source:
Google's C++ coding conventions
I mean in this case I make an Alias and modify in this way the vector I initialized before calling this function, so there is no copying back which is nice. So is this good programming practice?
No, you should use a local vector<double> variable and return by value. Any compiler worth using would optimize away the copying/moving, and any conforming C++11 compiler is required to perform a move if for whatever reason it cannot elide the copy/move altogether.
Your code as written imposes additional requirements on the caller that are not obvious. For instance, rowmeans must contain enough elements to store the means, or undefined behavior results.

Assign attributes with C++ to an arbitrary R object?

I have started with Rcpp and I am working through Hadley's book / page here.
I guess these basics are more than enough for me, still though I missed, some aspect or feel that this might be less basic:
How can I assign attributes to an arbitrary R Object using C++?
E.g.:
// [[Rcpp::export]]
NumericVector attribs(CharacterVector x,NumericVector y) {
NumericVector out = y;
out.attr("my-attr") = x;
return out;
}
I understand I have to specify the type in C++, but still I wonder whether there's a way to assign an attribute to ANY R object that I pass...
I have seen that settatr in the data.table works with C++, but seems to work only with elements of class data.table. Is there any way but writing an extra function for every R mode / class?
EDIT: The ultimate purpose is to speed up assigning attributes to each element of a list.
We had discussion here previously – but it did not involve Rcpp so far (except for using it via other packages.)
Maybe you want something like this? RObject is the generic class for all R objects. Note the use of clone so that you don't accidentally modify the object passed in.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
CharacterVector set_attr(CharacterVector x, RObject y) {
CharacterVector new_x = clone(x);
new_x.attr("my-attr") = y;
return new_x;
}
/*** R
x <- c("a", "b", "c")
set_attr(x, 1)
set_attr(x, "a")
attributes(x)
*/
Pardon my enthusiam: It's simply amazing how Rcpp helps an absolute novice to speed up code like that!
That's why I gave it a try though Hadley's answer perfectly covers the question. I tried to turn the input given here into a solution for the more specific case of adding attributes to a list of objects as fast as possible.
Even though my code is probably far from perfect I was already able to outperform all
functions suggested in the discussion, including data.table's setattr. I guess this is probably due to the fact that I let C++ not only to do the assignment but the looping as well.
Here's the example and benchmark:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
RObject fx(List x, CharacterVector y){
int n = x.size();
NumericVector new_el;
for(int i=0; i<n; i++) {
new_el = x[i];
new_el.attr("testkey") = y;
x[i] = new_el;
}
return(x);
}

CUDA: Sorting a vector< vector<int> > on the GPU

I have implemented my own comparator for STL's sort function, that helps sorting a std::vector< std::vector<int> > on the CPU.
The user gives as an input a std::vector< std::vector<int> > and also a string variable like for example 021. By having this string the sorting is done first on the first column, then on the third column and then on the second column. Example:
1 2 3
3 2 1
1 1 1
1 1 2
Let's say the string is 10
The output will be
1 1 1
1 1 2
1 2 3
3 2 1
My CPU implementation uses a class called Sorting, this class is implemented with the following two files:
Sorting.h
class Sorting{
private:
public:
Sorting();
~Sorting();
std::vector<std::vector<int>> applySort(std::vector<std::vector<int>>
data,const std::string& attr);
};
Sorting.cpp
Sorting::Sorting(){}
Sorting::~Sorting(){}
std::vector<std::vector<int>> Sorting::applySort(
std::vector<std::vector<int>> data, const std::string& attr){
std::sort(data.begin(), data.begin()+data.size(), Comparator(attr));
return data;
}
Comparator.h
class Comparator{
private:
std::string attr;
public:
Comparator(const std::string& attr) { this->attr = attr; }
bool operator()(const std::vector<int>& first, const std::vector<int>&
second){
size_t i;
for(i=0;i<attr.size();i++){
if(first[attr.at(i) - '0'] < second[attr.at(i) - '0']) return true;
else if(first[attr.at(i) - '0'] > second[attr.at(i)-'0'])
return false;
}
return false;
}
};
My implementation has been tested and it works properly. I'm interested in doing a similar CUDA implementation that would take advantage of GPU's capabilities in order to produce the output much more faster.
Initially I thought because my goal is a bit confusing, maybe changing an already known implementation for sorting in the GPU would do my job. However I started searching many implementations like the one described here: http://blogs.nvidia.com/2012/09/how-tesla-k20-speeds-up-quicksort-a-familiar-comp-sci-code/ and it made me realize that it would be something difficult to achieve.
I'm not sure if this is the best course of action. I started searching for libraries and found Thrust. However, although Thrust allows you to define your own comparator, in a question I asked yesterday I learned that it's not possible to create a host_vector < host_vector<int> >.
And I guess transforming my vector of vectors to a single vector wouldn't help me that much because I have no idea how I would then have to implement my comparator class.
I would like to hear your opinion on this issue:
how should I approach this problem?
Is it possible to achieve it with Thrust?
Would doing it in the GPU give a much better performance on my overall code? Note that the vector of vectors can be huge, millions of rows but only a few(5-10) columns.
Would it be better to design my own sort or change a sort function that is already available? This although sounds like a good idea, in practice I feel like it would take me a lot of effort to achieve. Using a simple comparator and a sort function from a library would be the best for me, however the limitations of Thrust don't allow me to do so.
Thank you in advance
i see that you are trying to implement a lexicographical sorting technique(but i have did it with single 1D huge vector), well i have been there and i have implemented a function which sorts the vectors but actually it is lagging way behind the lexicographical sorting, anyways i am not sure if i can post the code here, so if you need any help i would be glad to help
PS: look into the implementation of lexicographical_sort.cu in thrust example code (i have tweaked it also but that one also is lagging behind)
The comparator function you might need in order to check from two distintive places in 1D vector (which contains all the data) is listed down (by the way, this technique is way slower then CPU) but who know you might come up with idea to improve it or use it better then i do
struct arbitrary_functor
{
template <typename Tuple> __host__ __device__
void operator()(Tuple t)
{
if(thrust::get<0>(t)>thrust::get<1>(t))
thrust::get<2>(t) = 1;
else
thrust::get<2>(t) = 0;
}
};
int checkLexo_1vec(const thrust::device_vector<char> & A, int bv1, int bv2, int N){
int i;
thrust::device_vector<char> temp(N);
thrust::device_vector<char> sum(N);
thrust::for_each(thrust::make_zip_iterator(
thrust::make_tuple(A.begin()+bv2, A.begin()+bv1,temp.begin())),
thrust::make_zip_iterator(
thrust::make_tuple(A.end()+(bv2+N),A.end()+(bv1+N), temp.end())),
arbitrary_functor());
thrust::inclusive_scan(temp.begin(),temp.end(),sum.begin());
int a = thrust::lower_bound(sum.begin(),sum.end(),1) - sum.begin();
thrust::for_each(thrust::make_zip_iterator(
thrust::make_tuple(A.begin()+bv1, A.begin()+bv2, temp.begin())),
thrust::make_zip_iterator(
thrust::make_tuple(A.end()+(bv1+N), A.end()+(bv2+N),temp.end())),
arbitrary_functor());
thrust::inclusive_scan(temp.begin(),temp.end(),sum.begin());
int b = thrust::lower_bound(sum.begin(),sum.end(),1) - sum.begin();
if(a<=b)
return 1;
else
return 0;
}
i have found a reasonable method which can finally beat the CPU (not in terms of time but in terms of data elements)
actually my new method involves using of thrust::mismatch and i am attaching code for the function
The good thing about this version is that running time of this function is 2ms approx. with very large amount of data such as N = 1000000 to N = 1000, anyways i am posting the function code and do let me know if you find any of user find some other improvements which can reduce the overall running time
template<typename Ivec>
int lexoMM(Ivec vec, int bv1, int bv2, int N){
typedef thrust::device_vector<int>::iterator Iterator;
thrust::pair<Iterator,Iterator> result;
result = thrust::mismatch(vec.begin()+bv1, vec.begin()+(bv1+N-1), vec.begin()+bv2);
if(result.first == vec.end()){
//cout<<"Both are equal (right order)"<<endl;
return 1;
}
else if(result.first>result.second){
//cout<<"Wrong order"<<endl;
//cout<<*result.first<<","<<*result.second;
return 0;
}
else{
//cout<<"Right order"<<endl;
//cout<<*result.first<<","<<*result.second;
return 1;
}
}
PS: i feel like i really wasted my time in order to implementing my own version of this same thing, but thrust is awsome :)

Is it possible to use boost accumulators with vectors?

I wanted to use boost accumulators to calculate statistics of a variable that is a vector. Is there a simple way to do this. I think it's not possible to use the dumbest thing:
using namespace boost::accumulators;
//stuff...
accumulator_set<vector<double>, stats<tag::mean> > acc;
vector<double> some_vetor;
//stuff
some_vector = doStuff();
acc(some_vector);
maybe this is obvious, but I tried anyway. :P
What I wanted was to have an accumulator that would calculate a vector which is the mean of the components of many vectors. Is there an easy way out?
EDIT:
I don't know if I was thoroughly clear. I don't want this:
for_each(vec.begin(), vec.end(),acc);
This would calculate the mean of the entries of a given vector. What I need is different. I have a function that will spit vectors:
vector<double> doSomething();
// this is a monte carlo simulation;
And I need to run this many times and calculate the vectorial mean of those vectors:
for(int i = 0; i < numberOfMCSteps; i++){
vec = doSomething();
acc(vec);
}
cout << mean(acc);
And I want mean(acc) to be a vector itself, whose entry [i] would be the means of the entries [i] of the accumulated vectors.
Theres a hint about this in the docs of Boost, but nothing explicit. And I'm a bit dumb. :P
I've looked into your question a bit, and it seems to me that Boost.Accumulators already provides support for std::vector. Here is what I could find in a section of the user's guide :
Another example where the Numeric
Operators Sub-Library is useful is
when a type does not define the
operator overloads required to use it
for some statistical calculations.
For instance, std::vector<> does not overload any arithmetic operators, yet
it may be useful to use std::vector<>
as a sample or variate type. The
Numeric Operators Sub-Library defines
the necessary operator overloads in
the boost::numeric::operators
namespace, which is brought into scope
by the Accumulators Framework with a
using directive.
Indeed, after verification, the file boost/accumulators/numeric/functional/vector.hpp does contain the necessary operators for the 'naive' solution to work.
I believe you should try :
Including either
boost/accumulators/numeric/functional/vector.hpp before any other accumulators header
boost/accumulators/numeric/functional.hpp while defining BOOST_NUMERIC_FUNCTIONAL_STD_VECTOR_SUPPORT
Bringing the operators into scope with a using namespace boost::numeric::operators;.
There's only one last detail left : execution will break at runtime because the initial accumulated value is default-constructed, and an assertion will occur when trying to add a vector of size n to an empty vector. For this, it seems you should initialize the accumulator with (where n is the number of elements in your vector) :
accumulator_set<std::vector<double>, stats<tag::mean> > acc(std::vector<double>(n));
I tried the following code, mean gives me a std::vector of size 2 :
int main()
{
accumulator_set<std::vector<double>, stats<tag::mean> > acc(std::vector<double>(2));
const std::vector<double> v1 = boost::assign::list_of(1.)(2.);
const std::vector<double> v2 = boost::assign::list_of(2.)(3.);
const std::vector<double> v3 = boost::assign::list_of(3.)(4.);
acc(v1);
acc(v2);
acc(v3);
const std::vector<double> &meanVector = mean(acc);
}
I believe this is what you wanted ?
I don't have it set up to try right now, but if all boost::accumulators need is properly defined mathematical operators, then you might be able to get away with a different vector type: http://www.boost.org/doc/libs/1_37_0/libs/numeric/ublas/doc/vector.htm
And what about the documentation?
// The data for which we wish to calculate statistical properties:
std::vector< double > data( /* stuff */ );
// The accumulator set which will calculate the properties for us:
accumulator_set< double, features< tag::min, tag::mean > > acc;
// Use std::for_each to accumulate the statistical properties:
acc = std::for_each( data.begin(), data.end(), acc );