Duplicating dplyr::group_by's indices functionality with Rcpp - c++

As an exercise, I'm trying to use Rcpp and C++ to get grouping indices, much like what dplyr::group_by provides. These are the row numbers (starting from 0) corresponding to each group in the data.
Here's an example of what the indices would look like.
x <- sample(1:3, 10, TRUE)
x
# [1] 3 3 3 1 3 1 3 2 3 2
df <- data.frame(x)
attr(dplyr::group_by(df, x), "indices")
#[[1]]
#[1] 3 5
#
#[[2]]
#[1] 7 9
#
#[[3]]
#[1] 0 1 2 4 6 8
So far, using the standard library's std::unordered_multimap, I've come up with the following:
// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
using namespace Rcpp;
typedef std::vector<int> rowvec;
// [[Rcpp::export]]
std::vector<rowvec> rowlist(std::vector<int> x)
{
std::unordered_multimap<int, int> rowmap;
for (size_t i = 0; i < x.size(); i++)
{
rowmap.insert({ x[i], i });
}
std::vector<rowvec> rowlst;
for (size_t i = 0; i < rowmap.bucket_count(); i++)
{
if (rowmap.begin(i) != rowmap.end(i))
{
rowvec v(rowmap.count(i));
int b = 0;
for (auto it = rowmap.begin(i); it != rowmap.end(i); ++it, b++)
{
v[b] = it->second;
}
rowlst.push_back(v);
}
}
return rowlst;
}
Running this on a single variable results in
rowlist(x)
#[[1]]
#[1] 5 3
#
#[[2]]
#[1] 9 7
#
#[[3]]
#[1] 8 6 4 2 1 0
Other than the reversed ordering, this looks good. However, I can't figure out how to extend this to handle:
Different data types; the type is currently hardcoded into the function
More than one grouping variable
(std::unordered_multimap is also pretty slow compared to what group_by does, but I'll deal with that later.) Any help would be appreciated.

I've mulled over this question for quite some time, and my conclusion is that this is going to be quite difficult to say the least. In order to replicate the magic of dplyr::group_by, you are going to have to write several classes, and setup a really slick hashing function to deal with various data types and a differing number of columns. I've scoured the dplyr source code and it looks like if you follow the creation of ChunkMapIndex, you will get a better understanding.
Speaking of data types, I'm not even sure that using std::unordered_multimap can get you what you want as it is unwise and difficult to use double/float data type(s) as your key.
Given all of the challenges mentioned, the code below will produce the same output as attr(dplyr::group_by(df, x), "indices") with integers types. I've set it up to get you started thinking about how to deal with different data types. It uses a templated approach with a helper function as it is an easy an effective solution for dealing with different data types. The helper functions is very similar to the functions in the links Dirk provided.
// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
#include <string>
using namespace Rcpp;
typedef std::vector<int> rowvec;
typedef std::vector<rowvec> rowvec2d;
template <typename T>
rowvec2d rowlist(std::vector<T> x) {
std::unordered_multimap<T, int> rowmap;
for (int i = 0; i < x.size(); i++)
rowmap.insert({ x[i], i });
rowvec2d rowlst;
for (int i = 0; i < rowmap.bucket_count(); i++) {
if (rowmap.begin(i) != rowmap.end(i)) {
rowvec v(rowmap.count(i));
int b = 0;
for (auto it = rowmap.begin(i); it != rowmap.end(i); ++it, b++)
v[b] = it->second;
rowlst.push_back(v);
}
}
return rowlst;
}
template <typename T>
rowvec2d tempList(rowvec2d myList, std::vector<T> v) {
rowvec2d vecOut;
if (myList.size() > 0) {
for (std::size_t i = 0; i < myList.size(); i++) {
std::vector<T> vecPass(myList[i].size());
for (std::size_t j = 0; j < myList[i].size(); j++)
vecPass[j] = v[myList[i][j]];
rowvec2d vecTemp = rowlist(vecPass);
for (std::size_t j = 0; j < vecTemp.size(); j++) {
rowvec myIndex(vecTemp[j].size());
for (std::size_t k = 0; k < vecTemp[j].size(); k++)
myIndex[k] = myList[i][vecTemp[j][k]];
vecOut.push_back(myIndex);
}
}
} else {
vecOut = rowlist(v);
}
return vecOut;
}
// [[Rcpp::export]]
rowvec2d rowlistMaster(DataFrame myDF) {
DataFrame::iterator itDF;
rowvec2d result;
for (itDF = myDF.begin(); itDF != myDF.end(); itDF++) {
switch(TYPEOF(*itDF)) {
case INTSXP: {
result = tempList(result, as<std::vector<int> >(*itDF));
break;
}
default: {
stop("v must be of type integer");
}
}
}
return result;
}
It works with more than one grouping variable, however it is not nearly as fast.
set.seed(101)
x <- sample(1:5, 10^4, TRUE)
y <- sample(1:5, 10^4, TRUE)
w <- sample(1:5, 10^4, TRUE)
z <- sample(1:5, 10^4, TRUE)
df <- data.frame(x,y,w,z)
identical(attr(dplyr::group_by(df, x, y, w, z), "indices"), rowlistMaster(df))
[1] TRUE
library(microbenchmark)
microbenchmark(dplyr = attr(dplyr::group_by(df, x, y, w, z), "indices"),
challenge = rowlistMaster(df))
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 2.693624 2.900009 3.324274 3.192952 3.535927 6.827423 100
challenge 53.905133 70.091335 123.131484 141.414806 149.923166 190.010468 100

Related

Competitive programming problem I am stuck in

So there is a list of points (x1,y1,z1),(x2,y2,z2),...(xn,yn,zn).
You can perform an operation O on any number of these points.
Operation O results in (x,y,z)= (max(x1,x2,...xn),max(y1,y2,...yn),max(z1,z2,...zn)).
Given (x,y,z), you need to determine whether it is possible to perform operation O on some of the points in the list to get (x,y,z) as a result.
For eg: You are given points(1,2,1),(3,1,4),(5,2,1).
Can you perform O operation to get 1) (3,2,1) 2) (1,1,1)
First line contains n and q i.e the number of points and no. of queries
Next n lines contain the n points space seperated
Next q lines contain the q points which are the queries
1<=q<=10^5
1<=n<=10^5
x,y,z are integers
Input:
2 2
1 3 5
5 3 1
5 3 5
3 3 3
Expected Output:
YES
NO
My logic:
for (int i = 0; i < q; i++)
{
cin>>x>>y>>z;
for (int j = 0; j < n; j++)
{
if(arr[j][0]==x && arr[j][1]<=y && arr[j][2]<=z)
first=1;
if(arr[j][0]<=x && arr[j][1]==y && arr[j][2]<=z)
second=1;
if(arr[j][0]<=x && arr[j][1]<=y && arr[j][2]==z)
third=1;
if(first+second+third==3)
break;
}
if(first+second+third==3)
cout<<"YES\n";
else
{
cout<<"NO\n";
}
first=0;
second=0;
third=0;
}
Note: Here arr[][] contains the given coordinates.
for every x y z in queries q I am performing this operation.
Few test cases are failing giving me a runtime error (Time limit Exceeded). Is there a better way to do this.
Your solution is correct but slow, has O(q * n) complexity which is 10^10 at maximum, too much.
I've solved your task using sorting plus merging search, which has O(n log n + q log q) complexity.
The algorithm is as follows:
For each of three cases, (x, y, z), (y, x, z), (z, x, y), signified as coordinates (i0, i1, i2), we do next:
Sort all points and queries by tuple (i0, i1, i2).
Among each equal coordinates i0 compute cumulative minimum of i2.
Merge sorted points and queries next way: For each range of queries and points with equal i0 and i1 take rightmost minimal i2. If such minimal i2 for point is greater than query's i2 then answer for this query is NO, otherwise is probably YES (probably meaning that it should be YES for all 3 orderings of (x, y, z)/(y, x, z)/(z, x, y)).
Basically algorithm above does same thing as your algorithm does, it finds point with equal x, y, or z so that other two coordinates are not greater. But does it through fast algorithm of merging two sorted arrays. So that merging itself takes just O(q + n) time, while whole algorithm time is dominated by sorting algorithms taking O(q log q + n log n) time.
Try it online!
#include <iostream>
#include <vector>
#include <algorithm>
#include <tuple>
using namespace std;
typedef int CoordT;
typedef tuple<CoordT, CoordT, CoordT> CoordsT;
typedef vector<CoordsT> CoordsVecT;
template <size_t i0, size_t i1, size_t i2>
static void Solve(CoordsVecT const & ps, CoordsVecT const & qs, vector<bool> & yes) {
auto Prep = [&](auto & s, auto const & o){
s.clear();
s.reserve(o.size());
for (size_t i = 0; i < o.size(); ++i)
s.push_back(make_tuple(get<0>(o[i]), get<1>(o[i]), get<2>(o[i]), i));
sort(s.begin(), s.end(), [](auto const & l, auto const & r) -> bool {
return get<i0>(l) < get<i0>(r) || get<i0>(l) == get<i0>(r) && (
get<i1>(l) < get<i1>(r) || get<i1>(l) == get<i1>(r) &&
get<i2>(l) < get<i2>(r)
);
});
};
vector< tuple<CoordT, CoordT, CoordT, size_t> > sps, sqs;
Prep(sps, ps);
Prep(sqs, qs);
vector<CoordT> mins2(sps.size());
CoordT cmin2 = 0;
for (size_t i = 0; i < sps.size(); ++i) {
if (i == 0 || get<i0>(sps[i - 1]) != get<i0>(sps[i]))
cmin2 = get<i2>(sps[i]);
cmin2 = std::min(cmin2, get<i2>(sps[i]));
mins2[i] = cmin2;
}
for (size_t iq = 0, ip = 0; iq < sqs.size(); ++iq) {
auto & cyes = yes[get<3>(sqs[iq])];
if (!cyes)
continue;
while (ip < sps.size() && get<0>(sps[ip]) < get<0>(sqs[iq]))
++ip;
if (ip >= sps.size() || get<0>(sps[ip]) != get<0>(sqs[iq])) {
cyes = false;
continue;
}
while (ip + 1 < sps.size() && get<0>(sps[ip + 1]) == get<0>(sqs[iq]) && get<1>(sps[ip + 1]) <= get<1>(sqs[iq]))
++ip;
if (ip >= sps.size() || get<1>(sps[ip]) > get<1>(sqs[iq]) || mins2[ip] > get<2>(sqs[iq])) {
cyes = false;
continue;
}
}
}
int main() {
size_t n = 0, q = 0;
cin >> n >> q;
auto Input = [](CoordsVecT & v, size_t cnt) {
v.reserve(v.size() + cnt);
for (size_t i = 0; i < cnt; ++i) {
CoordT x, y, z;
cin >> x >> y >> z;
v.push_back(make_tuple(x, y, z));
}
};
CoordsVecT ps, qs;
Input(ps, n);
Input(qs, q);
vector<bool> yes(qs.size(), true);
Solve<0, 1, 2>(ps, qs, yes);
Solve<1, 0, 2>(ps, qs, yes);
Solve<2, 0, 1>(ps, qs, yes);
for (size_t i = 0; i < qs.size(); ++i)
cout << (yes[i] ? "YES" : "NO") << endl;
return 0;
}
Input:
2 2
1 3 5
5 3 1
5 3 5
3 3 3
Output:
YES
NO

How do I check on how many parts is a matrix divided in

I have a matrix made out of zeroes and ones. I need a way to see how many "zero blocks" there are. Here's a picture to better illustrate:
In this example there are 4 "zero blocks" divided by the black blocks (ones in the matrix).
As mentioned above, you can use dfs to find components in a graph. Here is a classic code example working on a grid where X means a wall and 0 means free space (black and white squares in your case):
#include <vector>
#include <string>
using Map = std::vector<std::string>;
using BoolMap = std::vector<std::vector<bool>>;
void dfs(BoolMap& visited, int x, int y)
{
if (x < 0 || y < 0 || y >= visited.size() || x >= visited[y].size())
return;
if (visited[y][x])
return;
visited[y][x] = true;
dfs(visited, x - 1, y);
dfs(visited, x + 1, y);
dfs(visited, x, y - 1);
dfs(visited, x, y + 1);
}
int main()
{
Map map;
map.emplace_back("0X00");
map.emplace_back("XXX0");
map.emplace_back("0X0X");
map.emplace_back("0X00");
BoolMap visited(map.size());
for (size_t y = 0; y < map.size(); y++)
{
visited[y].resize(map[y].size());
for (size_t x = 0; x < map[y].size(); x++)
{
// set visited to true if there is a wall
visited[y][x] = (map[y][x] == 'X');
}
}
size_t component_count = 0;
for (size_t y = 0; y < map.size(); y++)
{
for (size_t x = 0; x < map[y].size(); x++)
{
if (!visited[y][x])
{
dfs(visited, x, y);
component_count++;
}
}
}
std::cout << component_count << std::endl;
}
This code can be simplier if you know that your map is always a square one (map.size() can be used instead of map[y].size()). Also I loop through the map twice to check for a walls but if it is not too big there should not be a performance issue.
If you are already working with a boolean matrix and it is okay to change it you can just pass it as a visited parameter and the algorithm will work the same way.
I recommend taking a look at BFS & DFS for graph traversal. You can represent your matrix as a graph where every cell is connected to its neighbors in the 4 directions: North, South, East and west.
If you have a problem, let me know in the comments for more details.

Using std::bitset to count subsets of a finite set

I would like to check how many subsets S of the set [1, ..., 15]
there are so that it is impossible to choose two elements from S
such that their sum is a multiple of 3.
The algorithm to check this is as follows: there is a natural bijection between
the subsets of [1, ..., 15] and the strings of length 15 with two characters (assume
the two characters are '0' and '1' to fix a convention), where the character '0' in position i means that the integer i is not in the subset, while the character '1' in position i means that the integer i is in the subset.
For example, the string "111001000000000" is associated to the subset {1, 2, 3, 6}. This subset does not fulfill the constraint described above.
I wrote a C++ code to generate all such strings, convert them to a vector of
ints between 1 and 15, and check for all couples in this set if there is one
whose sum is a multiple of 3.
This is the code:
#include <algorithm>
#include <bitset>
#include <cmath>
#include <iostream>
#include <vector>
bool check(const std::vector<int>& dset) {
if (dset.size() == 1) {
if (dset[0] % 3 == 0) { return false; }
}
for (size_t i = 0; i < dset.size() - 1; ++i) {
auto a = dset[i];
for (size_t j = i + 1; j < dset.size(); ++j) {
auto b = dset[j];
if ((a + b) % 3 == 0) { return false; }
}
}
return true;
}
int main() {
const int N = 15; // We consider subsets of [1, ..., N].
int approved = 1; // We automatically approve the empty set.
std::bitset<N> set;
for (int n = 1; n < std::pow(2, N); ++n) {
set = std::bitset<N>(n);
std::vector<int> dset(set.count());
size_t j = 0;
for (int i = 1; i <= N; ++i) {
if (set[i - 1]) {
dset[j++] = i;
}
}
// Sweep through all couples in dset.
if (check(dset)) {
++approved;
}
}
std::cout << approved << " out of " << std::pow(2, N) << std::endl;
}
The problem is that my code returns 373, which is the wrong answer (the correct one should be 378).
I guess I am doing something wrong here, but I cannot find the error in my code.
In check() function you don't need to check for dset size being 1, since it is taken care below in the for loop from index i = 0 till valid size. Remove below if statement which should result valid number whose sum is multiple of 3:
if (dset.size() == 1) {
if (dset[0] % 3 == 0) { return false; }
}

Permutations &/ Combinations using c++

I need a different version of permutations for my code. I could able to achieve what I want but it is not generic enough. my algorithm keeps going bigger along with my requirements. But that should not be.
This is not a home work for any one, I need it for one my critical projects, wondering if any pre-defined algorithms available from boost or any other.
Below is the standard version of next_permutation using c++.
// next_permutation example
#include <iostream> // std::cout
#include <algorithm> // std::next_permutation
int main ()
{
int myints[] = {1,2,3};
do
{
std::cout << myints[0] << ' ' << myints[1] << ' ' << myints[2] << '\n';
} while ( std::next_permutation(myints,myints+3) );
return 0;
}
That gives below output :
1 2 3
1 3 2
2 1 3
2 3 1
3 1 2
3 2 1
But my requirement is :- Let's say I have 1 to 9 numbers :
1,2,3,4,5,6,7,8,9
And I need a variable length of permutations and in only ASCENDING order and with out DUPLICATES.
Let's say i need 3 digit length of permutations then i need output as below.
123
124
125
.
.
.
128
129
134 // After 129 next one should be exactly 134
135 // ascending order mandatory
136
.
.
.
148
149
156 // exactly 156 after 149, ascending order mandatory
.
.
.
489 // exactly 567 after 489, because after 3rd digit 9, 2nd digit
567 // will be increased to 49? , so there is no possibility for
. // 3rd digit, so first digit gets incremented to 5 then 6 then
. // 7, in ascending order.
.
.
.
789 // and this should be the last set I need.
My list may contain upto couple of hundred's of numbers and variable length can be 1 to up to Size of the list.
My own algorithm is working for specific variable length, and a specific size, when they both changes, i need to write huge code. so, looking for a generic one.
I am not even sure if this is called as Permutations or there is a different name available for this kind of math/logic.
Thanks in advance.
musk's
Formally, you want to generate all m-combinations of the set [0;n-1].
#include <iostream>
#include <vector>
bool first_combination (std::vector<int> &v, int m, int n)
{
if ((m < 0) || (m > n)) {
return false;
}
v.clear ();
v.resize (m);
for (int i = 0; i < m; i++) {
v[i] = i;
}
return true;
}
bool next_combination (std::vector<int> &v, int m, int n)
{
for (int i = m - 1; i >= 0; i--) {
if (v[i] + m - i < n) {
v[i]++;
for (int j = i + 1; j < m; j++) {
v[j] = v[j - 1] + 1;
}
return true;
}
}
return false;
}
void print_combination (const std::vector<int> &v)
{
for (size_t i = 0; i < v.size(); i++) {
std::cout << v[i] << ' ';
}
std::cout << '\n';
}
int main ()
{
const int m = 3;
const int n = 5;
std::vector<int> v;
if (first_combination (v, m, n)) {
do {
print_combination (v);
} while (next_combination (v, m, n));
}
}
You can use this code and the linked article as inspiration.
This task can be done with a simple iterative algorithm. Just increment the first element that can be incremented and rescale the elements before it until there is no element to be incremented.
int a[] = {0,1,2,3,4,5,6,7,8,9}; // elements: must be ascending in this case
int n = sizeof(a)/sizeof(int);
int digits = 7; // number of elements you want to choose
vector<int> indexes; // creating the first combination
for ( int i=digits-1;i>=0;--i ){
indexes.push_back(i);
}
while (1){
/// printing the current combination
for ( int i=indexes.size()-1;i>=0;--i ){
cout << a[indexes[i]] ;
} cout << endl;
///
int i = 0;
while ( i < indexes.size() && indexes[i] == n-1-i ) // finding the first element
++i; // that can be incremented
if ( i==indexes.size() ) // if no element can be incremented, we are done
break;
indexes[i]++; // increment the first element
for ( int j=0;j<i;++j ){ // rescale elements before it to first combination
indexes[j] = indexes[i]+(i-j);
}
}
Output:
0123456
0123457
0123458
0123459
0123467
0123468
0123469
0123478
0123479
0123489
0123567
0123568
0123569
0123578
0123579
0123589
0123678
0123679
0123689
0123789
0124567
0124568
0124569
0124578
0124579
0124589
0124678
0124679
0124689
0124789
0125678
0125679
0125689
0125789
0126789
0134567
0134568
0134569
0134578
0134579
0134589
0134678
0134679
0134689
0134789
0135678
0135679
0135689
0135789
0136789
0145678
0145679
0145689
0145789
0146789
0156789
0234567
0234568
0234569
0234578
0234579
0234589
0234678
0234679
0234689
0234789
0235678
0235679
0235689
0235789
0236789
0245678
0245679
0245689
0245789
0246789
0256789
0345678
0345679
0345689
0345789
0346789
0356789
0456789
1234567
1234568
1234569
1234578
1234579
1234589
1234678
1234679
1234689
1234789
1235678
1235679
1235689
1235789
1236789
1245678
1245679
1245689
1245789
1246789
1256789
1345678
1345679
1345689
1345789
1346789
1356789
1456789
2345678
2345679
2345689
2345789
2346789
2356789
2456789
3456789

Rcpp function to find the median, given a vector of values and their frequencies

I'm writing a function to find the median of a set of values. The data is presented as a vector of the unique values (call them 'values') and a vector of their frequencies ('freqs'). Frequently the frequencies are very high, so pasting them out uses an exorbitant amount of memory. I have a slow R implementation and it is the major bottleneck in my code, so I am writing a custom Rcpp function for use in an R/Bioconductor package. Bioconductor's site suggests not using C++11, so that is an issue for me.
My problem lies in trying to sort the two vectors together, according to the order of the values. In R, we can just use the order() function. I cannot seem to get this to work, despite following the advice on this question: C++ sorting and keeping track of indexes
The following lines are where the problem lies:
// sort vector based on order of values
IntegerVector idx_ord = std::sort(idx.begin(), idx.end(),
bool (int i1, int i2) {return values[i1] < values[i2];});
And here is the full function, for anyone's interest. Any further tips would be greatly appreciated:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double median_freq(NumericVector values, IntegerVector freqs) {
int len = freqs.size();
if (any(freqs!=0)){
int med = 0;
return med;
}
// filter out the zeros pre-sorting
IntegerVector non_zeros;
for (int i = 0; i < len; i++){
if(freqs[i] != 0){
non_zeros.push_back(i);
}
}
freqs = freqs[non_zeros];
values = values[non_zeros];
// find the order of values
// create integer vector of indices
IntegerVector idx(len);
for (int i = 0; i < len; ++i) idx[i] = i;
// sort vector based on order of values
IntegerVector idx_ord = std::sort(idx.begin(), idx.end(),
bool (int i1, int i2) {return values[i1] < values[i2];});
//apply to freqs and values
freqs = freqs[idx_ord];
values=values[idx_ord];
IntegerVector cum_freqs(len);
cum_freqs[0] = freqs[0];
for (int i = 1; i < len; ++i) cum_freqs[i] = freqs[i] + cum_freqs[i-1];
int total_freqs = cum_freqs[len-1];
// split into odd and even frequencies and calculate the median
if (total_freqs % 2 == 1) {
int med_ind = (total_freqs + 1)/2 - 1; // C++ indexes from 0
int i = 0;
while ((i < len) && cum_freqs[i] < med_ind){
i++;
}
double ret = values[i];
return ret;
} else {
int med_ind_1 = total_freqs/2 - 1; // C++ indexes from 0
int med_ind_2 = med_ind_1 + 1; // C++ indexes from 0
int i = 0;
while ((i < len) && cum_freqs[i] < med_ind_1){
i++;
}
double ret_1 = values[i];
i = 0;
while ((i < len) && cum_freqs[i] < med_ind_2){
i++;
}
double ret_2 = values[i];
double ret = (ret_1 + ret_2)/2;
return ret;
}
}
For anyone using the RUnit testing framework, here are some basic unit tests:
test_median_freq <- function(){
checkEquals(median_freq(1:10,1:10),7)
checkEquals(median_freq(1:10,rep(1,10)),5.5)
checkEquals(median_freq(2:6,c(1,2,1,45,2)),5)
}
Thanks!
I would actually combine the value and frequency into a std::pair<double, int> and then just sort them with std::sort; in this way you always keep a value and its frequency together. This enables you to write much cleaner code because there isn't an additional set of indices floating around:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double median_freq(NumericVector values, IntegerVector freqs) {
const int len = freqs.size();
std::vector<std::pair<double, int> > allDat;
int freqSum = 0;
for (int i=0; i < len; ++i) {
allDat.push_back(std::pair<double, int>(values[i], freqs[i]));
freqSum += freqs[i];
}
std::sort(allDat.begin(), allDat.end());
int accum = 0;
for (int i=0; i < len; ++i) {
accum += allDat[i].second;
if (freqSum % 2 == 0) {
if (accum > freqSum / 2) {
return allDat[i].first;
} else if (accum == freqSum / 2) {
return (allDat[i].first + allDat[i+1].first) / 2;
}
} else {
if (accum >= (freqSum+1)/2) {
return allDat[i].first;
}
}
}
return NA_REAL; // Should not be reached
}
Try it out in R:
median_freq(1:10, 1:10)
# [1] 7
median_freq(1:10,rep(1,10))
# [1] 5.5
median_freq(2:6,c(1,2,1,45,2))
# [1] 5
We can also code up a naive R implementation to determine the efficiency gains that we get from using Rcpp:
med.freq.r <- function(values, freqs) {
ord <- order(values)
values <- values[ord]
freqs <- freqs[ord]
s <- sum(freqs)
cs <- cumsum(freqs)
idx <- min(which(cs >= s/2))
if (s %% 2 == 0 && cs[idx] == s/2) {
(values[idx] + values[idx+1]) / 2
} else {
values[idx]
}
}
med.freq.r(1:10, 1:10)
# [1] 7
med.freq.r(1:10,rep(1,10))
# [1] 5.5
med.freq.r(2:6,c(1,2,1,45,2))
# [1] 5
To benchmark, let's look at a very large set of values:
set.seed(144)
values <- rnorm(1000000)
freqs <- sample(1:100, 1000000, replace=TRUE)
all.equal(median_freq(values, freqs), med.freq.r(values, freqs))
# [1] TRUE
library(microbenchmark)
microbenchmark(median_freq(values, freqs), med.freq.r(values, freqs))
# Unit: milliseconds
# expr min lq mean median uq max neval
# median_freq(values, freqs) 128.5322 131.6095 146.8360 145.6389 159.6117 165.0306 10
# med.freq.r(values, freqs) 715.2187 744.5709 776.0539 765.9178 817.7157 855.1898 10
For 1 million entries, the Rcpp solution is about 5x faster than the R solution; given the compilation overhead, that performance is only attractive if you're working on really huge vectors or if this is a frequently repeated option.
Linear-time approach
In general we know how to compute the median without sorting (for details, check out http://www.cc.gatech.edu/~mihail/medianCMU.pdf). While the algorithm is a bit more delicate than just sorting and iterating, it can yield significant speedups:
double fast_median_freq(NumericVector values, IntegerVector freqs) {
const int len = freqs.size();
std::vector<std::pair<double, int> > allDat;
int freqSum = 0;
for (int i=0; i < len; ++i) {
allDat.push_back(std::pair<double, int>(values[i], freqs[i]));
freqSum += freqs[i];
}
int target = freqSum / 2;
int low = 0;
int high = len-1;
while (true) {
// Random pivot; move to the end
int rnd = low + (rand() % (high-low+1));
std::swap(allDat[rnd], allDat[high]);
// In-place pivot
int highPos = low; // Start of values higher than pivot
int lowSum = 0; // Sum of frequencies of elements below pivot
for (int pos=low; pos < high; ++pos) {
if (allDat[pos].first <= allDat[high].first) {
lowSum += allDat[pos].second;
std::swap(allDat[highPos], allDat[pos]);
++highPos;
}
}
std::swap(allDat[highPos], allDat[high]); // Move pivot to "highPos"
// If we found the element then return; o/w recurse on proper side
if (lowSum >= target) {
// Recurse on lower elements
high = highPos - 1;
} else if (lowSum + allDat[highPos].second >= target) {
// Return
if (target < lowSum + allDat[highPos].second || freqSum % 2 == 1) {
return allDat[highPos].first;
} else {
double nextHighest = std::min_element(allDat.begin() + highPos+1, allDat.begin() + len-1)->first;
return (allDat[highPos].first + nextHighest) / 2;
}
} else {
// Recurse on higher elements
low = highPos + 1;
target -= (lowSum + allDat[highPos].second);
}
}
}
Benchmarking:
all.equal(median_freq(values, freqs), fast_median_freq(values, freqs))
[1] TRUE
microbenchmark(median_freq(values, freqs), med.freq.r(values, freqs), fast_median_freq(values, freqs), times=10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# median_freq(values, freqs) 119.57989 122.48622 130.47841 130.48811 132.75421 146.36136 10
# med.freq.r(values, freqs) 665.72803 690.15016 708.05729 702.65885 731.83936 749.36834 10
# fast_median_freq(values, freqs) 24.37572 29.39641 31.86144 31.77459 34.88418 36.81606 10
The linear approach offers a 4x speedup over the sort-then-iterate Rcpp solution and a 20x speedup over the base R solution.