Special values don't work as keys in unordered_map - c++

For special values like NA or NaN, boost::unordered_map creates a new key each time I use insert.
// [[Rcpp::depends(BH)]]
#include <boost/unordered_map.hpp>
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
void test_unordered_map(NumericVector vec) {
boost::unordered_map<double, int> mymap;
int n = vec.size();
for (int i = 0; i < n; i++) {
mymap.insert(std::make_pair(vec[i], i));
}
boost::unordered_map<double, int>::iterator it = mymap.begin(), end = mymap.end();
while (it != end) {
Rcout << it->first << "\t";
it++;
}
Rcout << std::endl;
}
/*** R
x <- c(sample(10, 100, TRUE), rep(NA, 5), NaN) + 0
test_unordered_map(x)
*/
Result:
> x <- c(sample(10, 100, TRUE), rep(NA, 5), NaN)
> test_unordered_map(x)
nan nan nan nan nan nan 4 10 9 5 7 6 2 3 1 8
How do I create only one key for NA and one for NaN?

bartop's idea of using a custom comperator is good, although the particular form did not work for me. So I used Boost's documentation as starting point. Combined with suitable functions from R I get:
// [[Rcpp::depends(BH)]]
#include <boost/unordered_map.hpp>
#include <Rcpp.h>
using namespace Rcpp;
struct R_equal_to : std::binary_function<double, double, bool> {
bool operator()(double x, double y) const {
return (R_IsNA(x) && R_IsNA(y)) ||
(R_IsNaN(x) && R_IsNaN(y)) ||
(x == y);
}
};
// [[Rcpp::export]]
void test_unordered_map(NumericVector vec) {
boost::unordered_map<double, int, boost::hash<double>, R_equal_to> mymap;
int n = vec.size();
for (int i = 0; i < n; i++) {
mymap.insert(std::make_pair(vec[i], i));
}
boost::unordered_map<double, int>::iterator it = mymap.begin(), end = mymap.end();
while (it != end) {
Rcout << it->first << "\t";
it++;
}
Rcout << std::endl;
}
/*** R
x <- c(sample(10, 100, TRUE), rep(NA, 5), NaN) + 0
test_unordered_map(x)
*/
Result:
> x <- c(sample(10, 100, TRUE), rep(NA, 5), NaN) + 0
> test_unordered_map(x)
7 2 nan nan 4 6 9 5 10 8 1 3
As desired, NA and NaN are inserted only once. However, one cannot differentiate between them in this output, since R's NA is just a special form of an IEEE NaN.

According to the IEEE standard, NaN values compared with == to anything yeilds always false. So, You just cannot do it this way. You can provide Your own comparator for unordered_map using this std::isnan function.
auto comparator = [](auto val1, auto val2) {
return std::isnan(val1) && std::isnan(val2) || val1 == val2;
}
boost::unordered_map<double, int, boost::hash<double>, decltype(comparator)> mymap(comparator);

Related

Converting R split() function to C++

Consider the reproducible example in R:
test <- c(1:12)
> test
[1] 1 2 3 4 5 6 7 8 9 10 11 12
The expected result:
test.list <- split(test, gl(2, 3))
> test.list
$`1`
[1] 1 2 3 7 8 9
$`2`
[1] 4 5 6 10 11 12
I am trying to write equivalent code in C++ to produce and return the two vectors that resulted from the test.list. Note that, I am in the embarrassing novice stage in C++.
We can use the nice answer by #jignatius and make it an R-callable function. For simplicity I keep it at NumericVector; we have a boatload of answers here that show show to switch between NumericVector and IntegerVector based on the run-time payload.
Code
#include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::List mysplit(Rcpp::NumericVector nums, int n, int size) {
std::vector<std::vector<double>> result(n);
int i = 0;
auto beg = nums.cbegin();
auto end = nums.cend();
while (beg != nums.cend()) {
//get end iterator safely
auto next = std::distance(beg, end) >= size ? beg + size : end;
//insert into result
result[i].insert(result[i].end(), beg, next);
//advance iterator
beg = next;
i = (i + 1) % n;
}
Rcpp::List ll;
for (const auto&v : result)
ll.push_back(v);
return ll;
}
/*** R
testvec <- 1:12
mysplit(testvec, 2, 3)
*/
Output
> Rcpp::sourceCpp("~/git/stackoverflow/68858728/answer.cpp")
> testvec <- 1:12
> mysplit(testvec, 2, 3)
[[1]]
[1] 1 2 3 7 8 9
[[2]]
[1] 4 5 6 10 11 12
>
There is a minor error in the original question in that we do not need a call to gl(); just the two scalars are needed.
Try this, which creates a vector of vectors containing the elements from the source in alternating chunks:
#include <iostream>
#include <vector>
template<typename T>
std::vector<std::vector<T>> split(std::vector<T> nums, int n, int size)
{
std::vector<std::vector<T>> result(n);
int i = 0;
auto beg = nums.cbegin();
auto end = nums.cend();
while (beg != nums.cend()) {
//get end iterator safely
auto next = std::distance(beg, end) >= size ? beg + size : end;
//insert into result
result[i].insert(result[i].end(), beg, next);
//advance iterator
beg = next;
i = (i + 1) % n;
}
return result;
}
int main()
{
std::vector<int> vnums = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 };
auto vectors = split(vnums, 2, 3);
for (const auto& v : vectors)
{
for (auto num : v) {
std::cout << num << " ";
}
std::cout << std::endl;
}
}
Demo

The conditional operator is not allowing the program to terminate

I just learnt about conditional operators and was doing an introductory exercise stating:
Write a program to use a conditional operator to find the elements in
a vector<int> that have odd value and double the value of each such
element.
Here is the code that I wrote:
int main()
{
vector<int> nums = { 1,2,3,4,5,6,7,8,9 };
int i;
auto beg = nums.begin();
while (*beg > 0) // This will always evaluate to true.
{
((*beg) % 2 == 0 && (beg < nums.end()) ? i = 0 : *beg = 2 * (*(beg++)));
/*If the number is even the program will just assign 0 to i*/
}
}
The program terminates AND gives you the correct output if you change the last line to:
((*beg)%2 == 0 && (beg < nums.end()) ? i = 0 : *beg = 2*(*(beg)));
++beg;
Why is this happening?
It stuck because, if the condition ((*beg)%2 == 0 && (beg < nums.end()) is true, the iterator will not increment for checking further. You have only setting i=0. You should increment the iterator as well.
You can use comma operator for this:
while (beg != nums.end() && *beg > 0)
{
(*beg) % 2 == 0 ? (beg++, i): (*beg = 2 * (*beg) , beg++, ++i );
}
Also note that the count i should be initialized before-hand, not in the while loop.
The complete working code as per the requirement would be:
#include <iostream>
#include <vector>
int main()
{
std::vector<int> nums = { 1,2,3,4,5,6,7,8,9 };
int i{0};
auto beg = nums.begin();
while (beg != nums.end() && *beg > 0)
{
(*beg) % 2 == 0 ? (beg++, i): (*beg = 2 * (*beg) , beg++, ++i );
}
for (const int ele : nums)
std::cout << ele << " ";
std::cout << "\ncount: " << i << "\n";
}
Output:
2 2 6 4 10 6 14 8 18
count: 5
That being said, IMO using comma operator along with conditional operator like the above(the task) is not a good coding manner, which will only make confusions for the future readers of your codebase.
Also read: Why is "using namespace std;" considered bad practice?
If you want to double some values and not others, just do it:
#include <iostream>
#include <vector>
int main() {
std::vector<int> nums = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
for (int& num : nums)
num = num % 2 ? 2 * num : num;
for (int num : nums)
std::cout << num << ' ';
std::cout << '\n';
return 0;
}
A conditional expression is an expression; you use it to compute a value. The code in the question does not do that; it uses the conditional expression as a way of selecting side effects, which is better done with an ordinary if statement.

Duplicating dplyr::group_by's indices functionality with Rcpp

As an exercise, I'm trying to use Rcpp and C++ to get grouping indices, much like what dplyr::group_by provides. These are the row numbers (starting from 0) corresponding to each group in the data.
Here's an example of what the indices would look like.
x <- sample(1:3, 10, TRUE)
x
# [1] 3 3 3 1 3 1 3 2 3 2
df <- data.frame(x)
attr(dplyr::group_by(df, x), "indices")
#[[1]]
#[1] 3 5
#
#[[2]]
#[1] 7 9
#
#[[3]]
#[1] 0 1 2 4 6 8
So far, using the standard library's std::unordered_multimap, I've come up with the following:
// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
using namespace Rcpp;
typedef std::vector<int> rowvec;
// [[Rcpp::export]]
std::vector<rowvec> rowlist(std::vector<int> x)
{
std::unordered_multimap<int, int> rowmap;
for (size_t i = 0; i < x.size(); i++)
{
rowmap.insert({ x[i], i });
}
std::vector<rowvec> rowlst;
for (size_t i = 0; i < rowmap.bucket_count(); i++)
{
if (rowmap.begin(i) != rowmap.end(i))
{
rowvec v(rowmap.count(i));
int b = 0;
for (auto it = rowmap.begin(i); it != rowmap.end(i); ++it, b++)
{
v[b] = it->second;
}
rowlst.push_back(v);
}
}
return rowlst;
}
Running this on a single variable results in
rowlist(x)
#[[1]]
#[1] 5 3
#
#[[2]]
#[1] 9 7
#
#[[3]]
#[1] 8 6 4 2 1 0
Other than the reversed ordering, this looks good. However, I can't figure out how to extend this to handle:
Different data types; the type is currently hardcoded into the function
More than one grouping variable
(std::unordered_multimap is also pretty slow compared to what group_by does, but I'll deal with that later.) Any help would be appreciated.
I've mulled over this question for quite some time, and my conclusion is that this is going to be quite difficult to say the least. In order to replicate the magic of dplyr::group_by, you are going to have to write several classes, and setup a really slick hashing function to deal with various data types and a differing number of columns. I've scoured the dplyr source code and it looks like if you follow the creation of ChunkMapIndex, you will get a better understanding.
Speaking of data types, I'm not even sure that using std::unordered_multimap can get you what you want as it is unwise and difficult to use double/float data type(s) as your key.
Given all of the challenges mentioned, the code below will produce the same output as attr(dplyr::group_by(df, x), "indices") with integers types. I've set it up to get you started thinking about how to deal with different data types. It uses a templated approach with a helper function as it is an easy an effective solution for dealing with different data types. The helper functions is very similar to the functions in the links Dirk provided.
// [[Rcpp::plugins(cpp11)]]
#include <Rcpp.h>
#include <string>
using namespace Rcpp;
typedef std::vector<int> rowvec;
typedef std::vector<rowvec> rowvec2d;
template <typename T>
rowvec2d rowlist(std::vector<T> x) {
std::unordered_multimap<T, int> rowmap;
for (int i = 0; i < x.size(); i++)
rowmap.insert({ x[i], i });
rowvec2d rowlst;
for (int i = 0; i < rowmap.bucket_count(); i++) {
if (rowmap.begin(i) != rowmap.end(i)) {
rowvec v(rowmap.count(i));
int b = 0;
for (auto it = rowmap.begin(i); it != rowmap.end(i); ++it, b++)
v[b] = it->second;
rowlst.push_back(v);
}
}
return rowlst;
}
template <typename T>
rowvec2d tempList(rowvec2d myList, std::vector<T> v) {
rowvec2d vecOut;
if (myList.size() > 0) {
for (std::size_t i = 0; i < myList.size(); i++) {
std::vector<T> vecPass(myList[i].size());
for (std::size_t j = 0; j < myList[i].size(); j++)
vecPass[j] = v[myList[i][j]];
rowvec2d vecTemp = rowlist(vecPass);
for (std::size_t j = 0; j < vecTemp.size(); j++) {
rowvec myIndex(vecTemp[j].size());
for (std::size_t k = 0; k < vecTemp[j].size(); k++)
myIndex[k] = myList[i][vecTemp[j][k]];
vecOut.push_back(myIndex);
}
}
} else {
vecOut = rowlist(v);
}
return vecOut;
}
// [[Rcpp::export]]
rowvec2d rowlistMaster(DataFrame myDF) {
DataFrame::iterator itDF;
rowvec2d result;
for (itDF = myDF.begin(); itDF != myDF.end(); itDF++) {
switch(TYPEOF(*itDF)) {
case INTSXP: {
result = tempList(result, as<std::vector<int> >(*itDF));
break;
}
default: {
stop("v must be of type integer");
}
}
}
return result;
}
It works with more than one grouping variable, however it is not nearly as fast.
set.seed(101)
x <- sample(1:5, 10^4, TRUE)
y <- sample(1:5, 10^4, TRUE)
w <- sample(1:5, 10^4, TRUE)
z <- sample(1:5, 10^4, TRUE)
df <- data.frame(x,y,w,z)
identical(attr(dplyr::group_by(df, x, y, w, z), "indices"), rowlistMaster(df))
[1] TRUE
library(microbenchmark)
microbenchmark(dplyr = attr(dplyr::group_by(df, x, y, w, z), "indices"),
challenge = rowlistMaster(df))
Unit: milliseconds
expr min lq mean median uq max neval
dplyr 2.693624 2.900009 3.324274 3.192952 3.535927 6.827423 100
challenge 53.905133 70.091335 123.131484 141.414806 149.923166 190.010468 100

sorting a 2d array by one row

I have a really specified problem to deal with. I need to descending sort an array[4][x].
From instance if i get values like:
{121,120,203,240}
{0.5,0.2,3.2,1.4}
{1.3,1.5,1.2,1.8}
{3 ,2 ,5 ,4 }
All values have to bo sorted by the 4th row. Thus, I need an output like this:
{203,240,121,120}
{3.2,1.4,0.5,0.2}
{1.2,1.8,1.3,1.5}
{5 ,4 ,3 ,2 }
I have tried doing it by the bubble sort method, but it does not work properly.
A straightforward approach of sorting the array using the bubble sort can look the following way
#include <iostream>
#include <iomanip>
#include <utility>
int main()
{
const size_t N = 4;
double a[][N] =
{
{ 121, 120, 203, 240 },
{ 0.5, 0.2, 3.2, 1.4 },
{ 1.3, 1.5, 1.2, 1.8 },
{ 3, 2, 5, 4 }
};
for (const auto &row : a)
{
for (double x : row) std::cout << std::setw( 3 ) << x << ' ';
std::cout << '\n';
}
std::cout << std::endl;
// The bubble sort
for (size_t n = N, last = N; not (n < 2); n = last)
{
for (size_t i = last = 1; i < n; i++)
{
if (a[N - 1][i - 1] < a[N - 1][i])
{
for (size_t j = 0; j < N; j++)
{
std::swap(a[j][i - 1], a[j][i]);
}
last = i;
}
}
}
for (const auto &row : a)
{
for (double x : row) std::cout << std::setw( 3 ) << x << ' ';
std::cout << '\n';
}
std::cout << std::endl;
return 0;
}
The program output is
121 120 203 240
0.5 0.2 3.2 1.4
1.3 1.5 1.2 1.8
3 2 5 4
203 240 121 120
3.2 1.4 0.5 0.2
1.2 1.8 1.3 1.5
5 4 3 2
All you need is to extract the code of the bubble sort from main and rewrite it as a separate function for any 2D array and any row used as the criteria of sorting.
The problem would be easy to solve if instead of parallel vectors we had a structure containing parallel values.
It is easy enough to get back to such a structure: just create some intermediate vector containing sort keys and indexes and sort it.
After sorting the indexes are giving us a direct way to reorder all the individual vectors in the right order.
I would do something like below (I put it in a Boost Unit Test, but what is done should be obvious) .
#define BOOST_AUTO_TEST_MAIN
#define BOOST_TEST_MODULE TestPenta
#include <boost/test/auto_unit_test.hpp>
#include <iostream>
#include <vector>
std::vector<int> v1 = {121,120,203,240};
std::vector<float> v2 = {0.5,0.2,3.2,1.4};
std::vector<float> v3 = {1.3,1.5,1.2,1.8};
std::vector<int> v4 = {3 ,2 ,5 ,4 };
std::vector<int> expected_v1 = {203,240,121,120};
std::vector<float> expected_v2 = {3.2,1.4,0.5,0.2};
std::vector<float> expected_v3 = {1.2,1.8,1.3,1.5};
std::vector<int> expected_v4 = {5 ,4 ,3 ,2 };
BOOST_AUTO_TEST_CASE(TestFailing)
{
// First create an index to sort containing sort key and initial position
std::vector<std::pair<int,int>> vindex{};
int i = 0;
for (auto x: v4){
vindex.push_back(std::pair<int,int>(x,i));
++i;
}
// Sort the index vector by key value
struct CmpIndex {
bool operator() (std::pair<int, int> & a, std::pair<int, int> & b) {
return a.first > b.first ;
}
} cmp;
std::sort(vindex.begin(), vindex.end(), cmp);
// Now reorder all the parallel vectors using index
// (of course in actual code we would write some loop if several vector are of the same type).
// I'm using parallel loops to avoid using too much memory for intermediate vectors
{
std::vector<int> r1;
for (auto & p: vindex){
r1.push_back(v1[p.second]);
}
v1 = r1;
}
{
std::vector<float> r2;
for (auto & p: vindex){
r2.push_back(v2[p.second]);
}
v2 = r2;
}
{
std::vector<float> r3;
for (auto & p: vindex){
r3.push_back(v3[p.second]);
}
v3 = r3;
}
{
std::vector<int> r4;
for (auto & p: vindex){
r4.push_back(v4[p.second]);
}
v4 = r4;
}
// Et voila! The vectors are all sorted as expected
i = 0;
for (int i = 0 ; i < 4 ; ++i){
BOOST_CHECK_EQUAL(expected_v1[i], v1[i]);
BOOST_CHECK_EQUAL(expected_v2[i], v2[i]);
BOOST_CHECK_EQUAL(expected_v3[i], v3[i]);
BOOST_CHECK_EQUAL(expected_v4[i], v4[i]);
++i;
}
}

Swap neighbouring elements in std::list

I want to change places of neighbouring elements in the std::list
Example of the list and values
A B C D E F G
3 2 1 2 1 3 2
What I expect to receive after sorting:
A B D C F E G
3 2 2 1 3 1 2
So, simple A > B = nothing to do, but C < D = swap them and go to E comparison.
I have no idea about how to swap neighbor elements.
So, I want to move forward to 1 step good elements
You can easily do this by using two iterators:
void biswap(std::list<int> &l)
{
if (l.size() < 2)
return;
auto it2 = l.begin();
auto it1 = it2++;
auto e = l.end();
for (;;)
{
if (*it1 < *it2)
std::swap(*it1, *it2);
it1 = it2++;
if (it2 == e)
return;
it1 = it2++;
if (it2 == e)
return;
}
}
Live example
Note: in case you're not using C++11 and thus calling size() might present a significant overhead, you can replace it with this (and of course replace all usage of auto with explicit types):
void biswap(std::list<int> &l)
{
auto it2 = l.begin();
auto e = l.end();
if (it2 == e)
return;
auto it1 = it2++;
if (it2 == e)
return;
for (;;)
// ... the rest as before
}
You can use standard algorithm std::adjacent_find to find a pair of elements for which first < second and then swap them.
For example
#include <iostream>
#include <algorithm>
#include <list>
#include <functional>
#include <iterator>
int main()
{
std::list<int> l = { 3, 2, 1, 2, 1, 3, 2 };
for ( int x : l ) std::cout << x << ' ';
std::cout << std::endl;
auto it = std::adjacent_find( l.begin(), l.end(), std::less<int>() );
if ( it != l.end() ) std::swap( *it, *std::next( it ) );
for ( int x : l ) std::cout << x << ' ';
std::cout << std::endl;
}
The output is
3 2 1 2 1 3 2
3 2 2 1 1 3 2
Or if you want to process all such situations when first < second then you can use the following code
#include <iostream>
#include <algorithm>
#include <list>
#include <functional>
#include <iterator>
int main()
{
std::list<int> l = { 3, 2, 1, 2, 1, 3, 2 };
for ( int x : l ) std::cout << x << ' ';
std::cout << std::endl;
std::list<int>::iterator it = l.begin();
while ( ( it = std::adjacent_find( it, l.end(), std::less<int>() ) ) != l.end() )
{
std::swap( *it, *std::next( it ) );
}
for ( int x : l ) std::cout << x << ' ';
std::cout << std::endl;
}
The output is
3 2 1 2 1 3 2
3 2 2 1 3 2 1
This gives the desired result and it switches the comparison for "good" elements, exactly as you asked:
#include <list>
#include <iostream>
using namespace std;
template<typename Type>
void print(const list<Type> & l)
{
for (auto element : l) cout << element << " ";
cout << endl;
}
int main(int argc, const char *argv[])
{
list<int> l = {3,2,1,2,1,3,2};
print(l);
auto it = l.begin();
while(std::next(it) != l.end())
{
if (*it < *std::next(it))
{
std::swap(*it, *std::next(it));
++it;
}
++it;
}
print(l);
return 0;
}
Executing results with:
3 2 1 2 1 3 2
3 2 2 1 3 1 2
In case you don't have C++11 support, simple C++03 solution using 2 iterators could be:
#include <iostream>
#include <algorithm>
#include <list>
void swapNeighbours(std::list<int>& l) {
std::list<int>::iterator it = l.begin();
std::list<int>::iterator prev = it++;
while (prev != l.end() && it != l.end()) {
// swap if needed:
if (*prev < *it)
std::swap(*prev, *it);
// move 2 times forward:
if (++it == l.end())
break;
prev = it++;
}
}
Then (based on your example) if you do:
void print(const std::list<int>& l) {
std::list<int>::const_iterator i;
for (i = l.begin(); i != l.end(); ++i) {
std::cout << *i << ' ';
}
std::cout << std::endl;
}
int main() {
std::list<int> l;
l.push_back(3); l.push_back(2); l.push_back(1); l.push_back(2);
l.push_back(1); l.push_back(3); l.push_back(2);
print(l);
swapNeighbours(l);
print(l);
}
then for the:
A B C D E F G
3 2 1 2 1 3 2
there will be following comparisons: A < B? (no) C < D? (yes, swaps) E < F? (yes, swaps too) yielding the output:
3 2 2 1 3 1 2
If all you want to do is a pairwise traversal and swap elements if the first one is less than the second then you can e.g. use std::adjacent_find like this:
using std::swap;
for (auto it = std::begin(l); it != std::end(l); it = std::adjacent_find(it, std::end(l), std::less<int>{})) {
swap(*it, *std::next(it));
}
This will cause the list of numbers 3 2 1 2 1 3 2 to be arranged as:
3 2 2 1 3 2 1
But, this is not the same as your expected result 3 2 2 1 3 1 2 and I don't undestand why you do not want to swap the last pair 1 2 like the others?
If what you want is an erratic pattern for sorting then there can be no simple solution. Instead you must use special handling of individual elements.