C++: Sampling from discrete distribution without replacement - c++

I'd like to sample from a discrete distribution without replacement (i.e., without repetition).
With the function discrete_distribution, it is possible to sample with replacement. And, with this function, I implemented sampling without replacement in a very rough way:
#include <iostream>
#include <random>
#include <vector>
#include <array>
int main()
{
const int sampleSize = 8; // Size of the sample
std::vector<double> weights = {2,2,1,1,2,2,1,1,2,2}; // 10 possible outcome with different weights
std::random_device rd;
std::mt19937 generator(rd());
/// WITH REPLACEMENT
std::discrete_distribution<int> distribution(weights.begin(), weights.end());
std::array<int, 10> p ={};
for(int i=0; i<sampleSize; ++i){
int number = distribution(generator);
++p[number];
}
std::cout << "Discrete_distribution with replacement:" << std::endl;
for (int i=0; i<10; ++i)
std::cout << i << ": " << std::string(p[i],'*') << std::endl;
/// WITHOUT REPLACEMENT
p = {};
for(int i=0; i<sampleSize; ++i){
std::discrete_distribution<int> distribution(weights.begin(), weights.end());
int number = distribution(generator);
weights[number] = 0; // the weight associate to the sampled value is set to 0
++p[number];
}
std::cout << "Discrete_distribution without replacement:" << std::endl;
for (int i=0; i<10; ++i)
std::cout << i << ": " << std::string(p[i],'*') << std::endl;
return 0;
}
Have you ever coded such sampling without replacement? Probably in a more optimized way?
Thank you.
Cheers,
T.A.

This solution might be a bit shorter. Unfortunately, it needs to create a discrete_distribution<> object in every step, which might be prohibitive when drawing a lot of samples.
#include <iostream>
#include <boost/random/discrete_distribution.hpp>
#include <boost/random/mersenne_twister.hpp>
using namespace boost::random;
int main(int, char**) {
std::vector<double> w = { 2, 2, 1, 1, 2, 2, 1, 1, 2, 2 };
discrete_distribution<> dist(w);
int n = 10;
boost::random::mt19937 gen;
std::vector<int> samples;
for (auto i = 0; i < n; i++) {
samples.push_back(dist(gen));
w[*samples.rbegin()] = 0;
dist = discrete_distribution<>(w);
}
for (auto iter : samples) {
std::cout << iter << " ";
}
return 0;
}
Improved answer:
After carefully looking for a similar question on this site (Faster weighted sampling without replacement), I found a stunningly simple algorithm for weighted sampling without replacement, it is just a bit complicated to implement in C++. Note, that this is not the most efficient algorithm, but it seems to me the simplest one to implement.
In https://doi.org/10.1016/j.ipl.2005.11.003 the method is described in detail.
Especially, it is not efficient if the sample size is much smaller than the basic population.
#include <iostream>
#include <iterator>
#include <boost/random/uniform_01.hpp>
#include <boost/random/mersenne_twister.hpp>
using namespace boost::random;
int main(int, char**) {
std::vector<double> w = { 2, 2, 1, 1, 2, 2, 1, 1, 2, 10 };
uniform_01<> dist;
boost::random::mt19937 gen;
std::vector<double> vals;
std::generate_n(std::back_inserter(vals), w.size(), [&dist,&gen]() { return dist(gen); });
std::transform(vals.begin(), vals.end(), w.begin(), vals.begin(), [&](auto r, auto w) { return std::pow(r, 1. / w); });
std::vector<std::pair<double, int>> valIndices;
size_t index = 0;
std::transform(vals.begin(), vals.end(), std::back_inserter(valIndices), [&index](auto v) { return std::pair<double,size_t>(v,index++); });
std::sort(valIndices.begin(), valIndices.end(), [](auto x, auto y) { return x.first > y.first; });
std::vector<int> samples;
std::transform(valIndices.begin(), valIndices.end(), std::back_inserter(samples), [](auto v) { return v.second; });
for (auto iter : samples) {
std::cout << iter << " ";
}
return 0;
}
Easier answer
I just removed some of the STL functions and replaced it with simple for loops.
#include <iostream>
#include <iterator>
#include <boost/random/uniform_01.hpp>
#include <boost/random/mersenne_twister.hpp>
#include <algorithm>
using namespace boost::random;
int main(int, char**) {
std::vector<double> w = { 2, 2, 1, 1, 2, 2, 1, 1, 2, 1000 };
uniform_01<> dist;
boost::random::mt19937 gen(342575235);
std::vector<double> vals;
for (auto iter : w) {
vals.push_back(std::pow(dist(gen), 1. / iter));
}
// Sorting vals, but retain the indices.
// There is unfortunately no easy way to do this with STL.
std::vector<std::pair<int, double>> valsWithIndices;
for (size_t iter = 0; iter < vals.size(); iter++) {
valsWithIndices.emplace_back(iter, vals[iter]);
}
std::sort(valsWithIndices.begin(), valsWithIndices.end(), [](auto x, auto y) {return x.second > y.second; });
std::vector<size_t> samples;
int sampleSize = 8;
for (auto iter = 0; iter < sampleSize; iter++) {
samples.push_back(valsWithIndices[iter].first);
}
for (auto iter : samples) {
std::cout << iter << " ";
}
return 0;
}

The existing answer by Aleph0 works the best of the ones I tested. I tried benchmarking the original solution, the one added by Aleph0, and a new one where you only make a new discrete_distribution when the existing one is over 50% already added items (redrawing when distribution produces an item already in the sample).
I tested with sample size == population size, and weights equal the index. I think the original solution in the question runs in O(n^2), my new one runs in O(n logn) and the one from the paper seems to run in O(n).
-------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------
BM_Reuse 25252721 ns 25251731 ns 26
BM_NewDistribution 17338706125 ns 17313620000 ns 1
BM_SomePaper 6789525 ns 6779400 ns 100
Code:
#include <array>
#include <benchmark/benchmark.h>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/uniform_01.hpp>
#include <iostream>
#include <iterator>
#include <random>
#include <vector>
const int sampleSize = 20000;
using namespace boost::random;
static void BM_ReuseDistribution(benchmark::State &state) {
std::vector<double> weights;
weights.resize(sampleSize);
for (auto _ : state) {
for (int i = 0; i < sampleSize; i++) {
weights[i] = i + 1;
}
std::random_device rd;
std::mt19937 generator(rd());
int o[sampleSize];
std::discrete_distribution<int> distribution(weights.begin(),
weights.end());
int numAdded = 0;
int distSize = sampleSize;
for (int i = 0; i < sampleSize; ++i) {
if (numAdded > distSize / 2) {
distSize -= numAdded;
numAdded = 0;
distribution =
std::discrete_distribution<int>(weights.begin(), weights.end());
}
int number = distribution(generator);
if (!weights[number]) {
i -= 1;
continue;
} else {
weights[number] = 0;
o[i] = number;
numAdded += 1;
}
}
}
}
BENCHMARK(BM_ReuseDistribution);
static void BM_NewDistribution(benchmark::State &state) {
std::vector<double> weights;
weights.resize(sampleSize);
for (auto _ : state) {
for (int i = 0; i < sampleSize; i++) {
weights[i] = i + 1;
}
std::random_device rd;
std::mt19937 generator(rd());
int o[sampleSize];
for (int i = 0; i < sampleSize; ++i) {
std::discrete_distribution<int> distribution(weights.begin(),
weights.end());
int number = distribution(generator);
weights[number] = 0;
o[i] = number;
}
}
}
BENCHMARK(BM_NewDistribution);
static void BM_SomePaper(benchmark::State &state) {
std::vector<double> w;
w.resize(sampleSize);
for (auto _ : state) {
for (int i = 0; i < sampleSize; i++) {
w[i] = i + 1;
}
uniform_01<> dist;
boost::random::mt19937 gen;
std::vector<double> vals;
std::generate_n(std::back_inserter(vals), w.size(),
[&dist, &gen]() { return dist(gen); });
std::transform(vals.begin(), vals.end(), w.begin(), vals.begin(),
[&](auto r, auto w) { return std::pow(r, 1. / w); });
std::vector<std::pair<double, int>> valIndices;
size_t index = 0;
std::transform(
vals.begin(), vals.end(), std::back_inserter(valIndices),
[&index](auto v) { return std::pair<double, size_t>(v, index++); });
std::sort(valIndices.begin(), valIndices.end(),
[](auto x, auto y) { return x.first > y.first; });
std::vector<int> samples;
std::transform(valIndices.begin(), valIndices.end(),
std::back_inserter(samples),
[](auto v) { return v.second; });
}
}
BENCHMARK(BM_SomePaper);
BENCHMARK_MAIN();

Thanks for your question and others' nice answer, I meet a same qustion as you. I think you needn't new distribution every time, instead of
dist.param({ wts.begin(), wts.end() });
complete codes are as follows:
//STL改进方案
#include <iostream>
#include <vector>
#include <random>
#include <iomanip>
#include <map>
#include <set>
int main()
{
//随机数引擎采用默认引擎
std::default_random_engine rng;
//随机数引擎采用设备熵值保证随机性
auto gen = std::mt19937{ std::random_device{}() };
std::vector<int> wts(24); //存储权重值
std::vector<int> in(24); //存储总体
std::set<int> out; //存储抽样结果
std::map<int, int> count; //输出计数
int sampleCount = 0; //抽样次数计数
int index = 0; //抽取的下标
int sampleSize = 24; //抽取样本的数量
int sampleTimes = 100000; //抽取样本的次数
//权重赋值
for (int i = 0; i < 24; i++)
{
wts.at(i) = 48 - 2 * i;
}
//总体赋值并输出
std::cout << "总体为24个:" << std::endl;
//赋值
for (int i = 0; i < 24; i++)
{
in.at(i) = i + 1;
std::cout << in.at(i) << " ";
}
std::cout << std::endl;
//产生按照给定权重的离散分布
std::discrete_distribution<size_t> dist{ wts.begin(), wts.end() };
auto probs = dist.probabilities(); // 返回概率计算结果
//输出概率计算结果
std::cout << "总体中各数据的权重为:" << std::endl;
std::copy(probs.begin(), probs.end(), std::ostream_iterator<double>
{ std::cout << std::fixed << std::setprecision(5), “ ”});
std::cout << std::endl << std::endl;
//==========抽样测试==========
for (size_t j = 0; j < sampleTimes; j++)
{
index = dist(gen);
//std::cout << index << “ ”; //输出抽样结果
count[index] += 1; //抽样结果计数
}
double sum = 0.0; //用于概率求和
//输出抽样结果
std::cout << "总共抽样" << sampleTimes << "次," << "各下标的频数及频率为:" << std::endl;
for (size_t i = 0; i < 24; i++)
{
std::cout << i << "共有" << count[i] << "个 频率为:" << count[i] / double(sampleTimes) << std::endl;
sum += count[i] / double(sampleTimes);
}
std::cout << "总频率为:" << sum << std::endl << std::endl; //输出总概率
//==========抽样测试==========
//从总体中抽样放入集合中,直至集合大小达到样本数
while (out.size() < sampleSize - 1)
{
index = dist(gen); //抽取下标
out.insert(index); //插入集合
sampleCount += 1; //抽样次数增加1
wts.at(index) = 0; //将抽取到的下标索引的权重设置为0
dist.param({ wts.begin(), wts.end() });
probs = dist.probabilities(); // 返回概率计算结果
//输出概率计算结果
std::cout << "总体中各数据的权重为:" << std::endl;
std::copy(probs.begin(), probs.end(), std::ostream_iterator<double>
{ std::cout << std::fixed << std::setprecision(5), “ ”});
std::cout << std::endl << std::endl;
}
//最后一次抽取,单独出来是避免将所有权重都为0,的权重数组赋值给离散分布dist,避免报错
index = dist(gen); //抽取下标
out.insert(index); //插入集合
sampleCount += 1; //抽样次数增加1
//输出抽样结果
std::cout << "从总体中抽取的" << sampleSize << "个样本的下标索引为:" << std::endl;
for (auto iter : out)
{
std::cout << iter << “-”;
}
std::cout << std::endl;
//输出抽样次数
std::cout << "抽样次数为:" << sampleCount << std::endl;
out.clear(); //清空输出集合,为下次抽样做准备
std::cin.get(); //保留控制台窗口
return 0;
}

Related

Fastest way to remove items from Boost Rtree

I've tested the code from [1] to remove items from a Boost Rtree. However, it is quite slow. It takes about 1s to insert 1M records and 12s to remove them.
Can it be done faster?
Here is the test code:
#include <vector>
#include <iostream>
#include <stdio.h>
#include <boost/geometry.hpp>
#include <boost/geometry/index/rtree.hpp>
#include <boost/timer.hpp>
struct Rect
{
Rect() {}
Rect(int a_minX, int a_minY, int a_maxX, int a_maxY)
{
min[0] = a_minX;
min[1] = a_minY;
max[0] = a_maxX;
max[1] = a_maxY;
}
int min[2];
int max[2];
};
int main()
{
// randomize rectangles
std::vector<Rect> rects;
for (size_t i = 0 ; i < 1000000 ; ++i)
{
int min_x = rand() % 100000;
int min_y = rand() % 100000;
int w = 1 + rand() % 100;
int h = 1 + rand() % 100;
rects.push_back(Rect(min_x, min_y, min_x+w, min_y+h));
}
// create the rectangle passed into the query
Rect search_rect(4, 4, 6, 6);
// create the Boost.Geometry R-tree
namespace bg = boost::geometry;
namespace bgi = boost::geometry::index;
typedef bg::model::point<double, 2, bg::cs::cartesian> point_t;
typedef bg::model::box<point_t> box_t;
typedef std::pair<box_t, uint64_t> value_t;
bgi::rtree<value_t, bgi::quadratic<8,4> > bg_tree;
{
boost::timer t;
for(size_t i = 0; i < rects.size(); i++)
{
Rect const& r = rects[i];
box_t b(point_t(r.min[0], r.min[1]), point_t(r.max[0], r.max[1]));
bg_tree.insert(value_t(b, i));
}
double s = t.elapsed();
std::cout << s << " Boost insert time" << std::endl;
}
// test BG Rtree
{
std::vector<value_t> res;
box_t search_box(
point_t(search_rect.min[0], search_rect.min[1]),
point_t(search_rect.max[0], search_rect.max[1]));
size_t sum = 0;
boost::timer t;
for (size_t i = 0 ; i < 10000 ; ++i)
{
res.clear();
sum += bg_tree.query(bgi::intersects(search_box), std::back_inserter(res));
}
double s = t.elapsed();
std::cout << s << " Boost query " << sum << std::endl;
}
{
boost::timer t;
std::cout << "Tree contains " << bg_tree.size() << " box-id values." << std::endl;
while (!bg_tree.empty()) {
// 1. Choose arbitrary BoxIdPair to be the leader of a new canopy.
// Remove it from the tree. Insert it into the canopy map, with its
// corresponding id.
Rect const& r = rects[rects.size()-1];
point_t origin(r.min[0], r.min[0]);
rects.pop_back();
auto first = bgi::qbegin(bg_tree, bgi::nearest(origin, 1)),
last = bgi::qend(bg_tree);
if (first != last) {
bg_tree.remove(*first); // assuming single result
}
}
double s = t.elapsed();
std::cout << s << " Boost tree emptied " << std::endl;
}
}
Example output:
1.09421 Boost insert time
0.000371 Boost query 0
Tree contains 1000000 box-id values.
12.7237 Boost tree emptied
[1] Issue with removing points from a boost::geometry::index::rtree

Convert vector<int> to integer

I was looking for pre-defined function for converting a vector of integers into a normal integer but i din't find one.
vector<int> v;
v.push_back(1);
v.push_back(2);
v.push_back(3);
Need this:
int i=123 //directly converted from vector to int
Is there a possible way to achieve this?
If elements of vector are digits:
int result = 0;
for (auto d : v)
{
result = result * 10 + d;
}
If not digits:
stringstream str;
copy(v.begin(), v.end(), ostream_iterator<int>(str, ""));
int res = stoi(str.str());
Using C++ 11:
reverse(v.begin(), v.end());
int decimal = 1;
int total = 0;
for (auto& it : v)
{
total += it * decimal;
decimal *= 10;
}
EDIT: Now it should be the right way.
EDIT 2: See DAle's answer for a shorter/simpler one.
For the sake of wrapping it into a function to make it re-usable. Thanks #Samer
int VectorToInt(vector<int> v)
{
reverse(v.begin(), v.end());
int decimal = 1;
int total = 0;
for (auto& it : v)
{
total += it * decimal;
decimal *= 10;
}
return total;
}
One liner with C++11 using std::accumulate():
auto rz = std::accumulate( v.begin(), v.end(), 0, []( int l, int r ) {
return l * 10 + r;
} );
live example
In conjunction with the answer provided by deepmax in this post Converting integer into array of digits and the answers provided by multiple users in this post, here is a complete test program with a function to convert an integer to a vector and a function to convert a vector to an integer:
// VecToIntToVec.cpp
#include <iostream>
#include <vector>
// function prototypes
int vecToInt(const std::vector<int> &vec);
std::vector<int> intToVec(int num);
int main(void)
{
std::vector<int> vec = { 3, 4, 2, 5, 8, 6 };
int num = vecToInt(vec);
std::cout << "num = " << num << "\n\n";
vec = intToVec(num);
for (auto &element : vec)
{
std::cout << element << ", ";
}
return(0);
}
int vecToInt(std::vector<int> vec)
{
std::reverse(vec.begin(), vec.end());
int result = 0;
for (int i = 0; i < vec.size(); i++)
{
result += (pow(10, i) * vec[i]);
}
return(result);
}
std::vector<int> intToVec(int num)
{
std::vector<int> vec;
if (num <= 0) return vec;
while (num > 0)
{
vec.push_back(num % 10);
num = num / 10;
}
std::reverse(vec.begin(), vec.end());
return(vec);
}
Working solution for negative numbers too!
#include <iostream>
#include <vector>
using namespace std;
template <typename T> int sgn(T val) {
return (T(0) < val) - (val < T(0));
}
int vectorToInt(vector<int> v) {
int result = 0;
if(!v.size()) return result;
result = result * 10 + v[0];
for (size_t i = 1; i < v.size(); ++i) {
result = result * 10 + (v[i] * sgn(v[0]));
}
return result;
}
int main(void) {
vector<int> negative_value = {-1, 9, 9};
cout << vectorToInt(negative_value) << endl;
vector<int> zero = {0};
cout << vectorToInt(zero) << endl;
vector<int> positive_value = {1, 4, 5, 3};
cout << vectorToInt(positive_value) << endl;
return 0;
}
Output:
-199
0
1453
Live Demo
The other answers (as of May '19) seem to assume positive integers only (maybe 0 too). I had negative inputs, thus, I extended their code to take into account the sign of the number as well.

Split an array at a specific value C++

Say I have an array like this:
int arr [9] = {2,1,5,8,9,4,10,15,20}
How can you split the array at a certain value threshold? So say int 8 is our splitting value, the end result would be two separate arrays (or a 2d array if you want to give that a shot) that in this example would be arr1 [4] = {1,2,4,5} and arr2 [5] = {8,9,10,15,20}. arr1 stores all the values in arr that are below 8 and and arr2 stores all the values in arr that are 8 and above.
I haven't been able to locate sufficient documentation or examples of this being done and I think array manipulation and splitting is worth having examples of.
Use std::partition, or if you want to maintain the relative order and not sort the data, std::stable_partition.
#include <algorithm>
#include <iostream>
#include <vector>
int main()
{
int pivot = 8;
int arr [9] = {2,1,5,8,9,4,10,15,20};
// get partition point
int *pt = std::stable_partition(arr, std::end(arr), [&](int n) {return n < pivot;});
// create two vectors consisting of left and right hand side
// of partition
std::vector<int> a1(arr, pt);
std::vector<int> a2(pt, std::end(arr));
// output results
for (auto& i : a1)
std::cout << i << " ";
std::cout << '\n';
for (auto& i : a2)
std::cout << i << " ";
}
Live Example
If you can use C++11 then this is one way of using the standard library:
Using a partition_point: (edited the example from the link)
#include <algorithm>
#include <array>
#include <iostream>
#include <iterator>
#include <vector>
int main()
{
std::array<int, 9> v = {2,1,5,8,9,4,10,15,20};
auto is_lower_than_8 = [](int i){ return i < 8; };
std::partition(v.begin(), v.end(), is_lower_than_8 );
auto p = std::partition_point(v.begin(), v.end(), is_lower_than_8 );
std::cout << "Before partition:\n ";
std::vector<int> p1(v.begin(), p);
std::sort(p1.begin(), p1.end());
std::copy(p1.begin(), p1.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << "\nAfter partition:\n ";
std::vector<int> p2(p, v.end());
std::sort(p2.begin(), p2.end());
std::copy(p2.begin(), p2.end(), std::ostream_iterator<int>(std::cout, " "));
}
Which prints:
Before partition:
1 2 4 5
After partition:
8 9 10 15 20
I'm working on a solution with loops. This is a work in progress. Let me know what you think.
void splitarr(int arr[], int length) {
int accu = 0;
int accu2 = 0;
int splitter = rand() % 20;
for (int i = 0; i < length; i++) {
if (i != splitter) {
accu++;
}
}
int arr1[accu];
for (int i = 0; i < length; i++) {
if (i != splitter) {
arr1[i] = i;
}
}
for (int i = 0; i < length; i++) {
if (i == splitter) {
accu2++;
}
}
int arr2[accu2];
for (int i = 0; i < length; i++) {
if (i == splitter) {
arr2[i] = i;
}
}
}

Fastest way to determine whether elements of a vector y occur in a vector x

I have the following problem: I have two vectors x and y of type double that are increasingly sorted and I would like to obtain a vector z indicating whether an element of y is present in x. Up to now, I have used std::binary_search in a for-loop as illustrated below, but I think there should be a faster way making use of the fact that also x is sorted?
The issue is that this needs to be super fast as it turns out to be the bottleneck in my code.
For those familiar with R, I need an equivalent to match(y, x, nomatch = 0L) > 0L.
#include <iostream>
#include <algorithm>
#include <vector>
int main() {
using namespace std;
vector<double> x = {1.8, 2.4, 3.3, 4.2, 5.6,7.9, 8.5, 9.3};
vector<double> y = {0.5, 0.98, 1.8, 3.1, 5.6, 6.6, 9.3, 9.3, 9.5};
vector<bool> z(y.size());
for (int i = 0; i != y.size(); ++i)
z[i] = binary_search(x.begin(), x.end(), y[i]);
for (vector<bool>::const_iterator i = z.begin(); i != z.end(); ++i)
cout << *i << " ";
return 0;
}
EDIT
Here are representative sample data for my problem:
#include <iostream>
#include <algorithm>
#include <vector>
#include <cstdlib>
#include <ctime>
// function generator:
double RandomNumber () { return (std::rand() / 10e+7); }
int main() {
using namespace std;
std::srand ( unsigned ( std::time(0) ) );
// 5000 is representative
int n = 5000;
std::vector<double> x (n);
std::generate (x.begin(), x.end(), RandomNumber);
std::vector<double> y (n);
std::generate (y.begin(), y.end(), RandomNumber);
for(std::vector<double>::const_iterator i = x.begin(); i != x.end(); i++) {
y.push_back(*i);
}
std::sort(x.begin(), x.end());
std::sort(y.begin(), y.end());
return 0;
}
You can use std::set_itersection:
#include <vector>
#include <algorithm>
#include <iterator>
#include <iostream>
int main()
{
std::vector<double> x {1.8, 2.4, 3.3, 4.2, 5.6,7.9, 8.5, 9.3};
std::vector<double> y {0.5, 0.98, 1.8, 3.1, 5.6, 6.6, 9.3, 9.3, 9.5};
std::vector<double> z {};
std::set_intersection(std::cbegin(x), std::cend(x),
std::cbegin(y), std::cend(y),
std::back_inserter(z));
std::copy(std::cbegin(z), std::cend(z),
std::ostream_iterator<double> {std::cout, " "});
}
Edit
To address Dieter Lücking point in the comments, here is a version that more closely matches R's match function:
#include <vector>
#include <deque>
#include <algorithm>
#include <iterator>
#include <functional>
#include <memory>
#include <iostream>
template <typename T>
std::deque<bool> match(const std::vector<T>& y, const std::vector<T>& x)
{
std::vector<std::reference_wrapper<const T>> z {};
z.reserve(std::min(y.size(), x.size()));
std::set_intersection(std::cbegin(y), std::cend(y),
std::cbegin(x), std::cend(x),
std::back_inserter(z));
std::deque<bool> result(y.size(), false);
for (const auto& e : z) {
result[std::distance(std::addressof(y.front()), std::addressof(e.get()))] = true;
}
return result;
}
int main()
{
std::vector<double> x {1.8, 2.4, 3.3, 4.2, 5.6,7.9, 8.5, 9.3};
std::vector<double> y {0.5, 0.98, 1.8, 3.1, 5.6, 6.6, 9.3, 9.3, 9.5};
const auto matches = match(y, x);
std::copy(std::cbegin(matches), std::cend(matches),
std::ostream_iterator<bool> {std::cout});
}
I picked up all your codes, Dieter timing sample and the sample data of 5000 random doubles of the OP to perform a more complete timing of all the alternatives. This is the code:
#include <chrono>
#include <iostream>
#include <algorithm>
#include <vector>
#include <iterator>
#include <cstdlib>
#include <ctime>
#include <assert.h>
#include <deque>
#include <functional>
#include <memory>
using namespace std;
double RandomNumber () { return (std::rand() / 10e+7); }
template <typename T>
std::deque<bool> match(const std::vector<T>& y, const std::vector<T>& x)
{
std::vector<std::reference_wrapper<const T>> z {};
z.reserve(std::min(y.size(), x.size()));
std::set_intersection(y.cbegin(), y.cend(),
x.cbegin(), x.cend(),
std::back_inserter(z));
std::deque<bool> result(y.size(), false);
for (const auto& e : z) {
result[std::distance(std::addressof(y.front()), std::addressof(e.get()))] = true;
}
return result;
}
int main() {
const int NTESTS = 10;
long long time1 = 0;
long long time2 = 0;
long long time3 = 0;
long long time3_prime = 0;
long long time4 = 0;
long long time5 = 0;
long long time6 = 0;
for (int i = 0; i < NTESTS; ++i){
std::srand ( unsigned ( std::time(0) ) );
// 5000 is representative
int n = 5000;
std::vector<double> x (n);
std::generate (x.begin(), x.end(), RandomNumber);
std::vector<double> y (n);
std::generate (y.begin(), y.end(), RandomNumber);
for(std::vector<double>::const_iterator i = x.begin(); i != x.end(); i++) {
y.push_back(*i);
}
std::sort(x.begin(), x.end());
std::sort(y.begin(), y.end());
vector<bool> z1(y.size());
vector<unsigned char> z2(y.size());
vector<unsigned char> z3(y.size());
std::deque<bool> z3_prime;
vector<bool> z4(y.size());
std::vector<bool> z5(y.size());
std::vector<bool> z6(y.size());
// Original
{
auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i != y.size(); ++i) {
z1[i] = binary_search(x.begin(), x.end(), y[i]);
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::nanoseconds>(stop - start);
time1 += duration.count();
}
// Original (replacing vector<bool> by vector<unsigned char>)
{
auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i != y.size(); ++i) {
z2[i] = binary_search(x.begin(), x.end(), y[i]);
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::nanoseconds>(stop - start);
time2 += duration.count();
}
{ // Dieter Lücking set_intersection
auto start = std::chrono::high_resolution_clock::now();
size_t ix = 0;
size_t iy = 0;
while(ix < x.size() && iy < y.size())
{
if(x[ix] < y[iy]) ++ix;
else if(y[iy] < x[ix]) ++iy;
else {
z3[iy] = 1;
// ++ix; Not this if one vector is not uniquely sorted
++iy;
}
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::nanoseconds>(stop - start);
time3 += duration.count();
}
// Std::set_intersection
{
auto start = std::chrono::high_resolution_clock::now();
z3_prime = match(y, x);
auto stop = std::chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::nanoseconds>(stop - start);
time3_prime += duration.count();
}
{ // Ed Heal
auto start = std::chrono::high_resolution_clock::now();
int i_x = 0, i_y = 0;
while (i_x < x.size() && i_y < y.size())
{
if (x[i_x] == y[i_y]) {
//cout << "In both" << x[i_x] << endl;
z4[i_y] = true;
++i_x;
++i_y;
} else if (x[i_x] < y[i_y]) {
++i_x;
} else {
z4[i_y] = false;
++i_y;
}
}
/* for (; i_y < y.size(); ++i_y) {
//Empty
} */
auto stop = std::chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::nanoseconds>(stop - start);
time4 += duration.count();
}
{ // JacquesdeHooge
auto start = std::chrono::high_resolution_clock::now();
auto it_x = x.begin();
int i = 0;
for (; i < (int)y.size(); ++i) {
it_x = std::lower_bound(it_x, x.end(), y[i]);
if (it_x == x.end()) break;
z5[i] = *it_x == y[i];
}
std::fill(z5.begin() + i, z5.end(), false);
auto stop = std::chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::nanoseconds>(stop - start);
time5 += duration.count();
}
{ // Skizz
auto start = std::chrono::high_resolution_clock::now();
vector<double>::iterator a = x.begin(), b = y.begin();
int i = 0;
while (a != x.end () && b != y.end ())
{
if (*a == *b) {
z6[i] = true;
++a;
++b;
}
else
{
z6[i] = false;
if (*a < *b)
{
++a;
}
else
{
++b;
}
}
i++;
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::nanoseconds>(stop - start);
time6 += duration.count();
}
assert (std::equal(z1.begin(), z1.begin() + 5000, z2.begin()));
assert (std::equal(z1.begin(), z1.begin() + 5000, z3.begin()));
assert (std::equal(z1.begin(), z1.begin() + 5000, z3_prime.begin()));
assert (std::equal(z1.begin(), z1.begin() + 5000, z4.begin()));
assert (std::equal(z1.begin(), z1.begin() + 5000, z5.begin()));
assert (std::equal(z1.begin(), z1.begin() + 5000, z6.begin()));
}
cout << "Original - vector<bool>: \t\t" << time1 << " ns\n";
cout << "Original - vector<unsigned char>: \t" << time2 << " ns\n";
cout << "Set intersection (Daniel): \t\t" << time3_prime << " ns\n";
cout << "Set intersection (Dieter Lücking): \t" << time3 << " ns\n";
cout << "Ed Heal: \t\t\t\t" << time4 << " ns\n";
cout << "JackesdeHooge: \t\t\t\t" << time5 << " ns\n";
cout << "Skizz: \t\t\t\t\t" << time6 << " ns\n";
cout << endl;
return 0;
}
My results with g++ 5.2.1 -std::c++11 and -O3:
Original - vector: 10152069 ns
Original - vector: 8686619 ns
Set intersection (Daniel): 1768855 ns
Set intersection (Dieter Lücking): 1617106 ns
Ed Heal: 1446596 ns
JackesdeHooge: 3998958 ns
Skizz: 1385193 ns
*Please note Ed Heal and Skizz solutions are essentially the same.
Since both vectors are sorted, you have to apply bin search only on the remainder part of the second vector.
So if you e.g. don't find x [i] in before y [j], you're certain you also won't find x [i + 1] before y [j]. In finding a match for x [i + 1] it therefore suffices to apply bin search starting with y [j].
Off the top of my head, I can only think of this:-
vector<double>::iterator a = x.begin(), b = y.begin();
while (a != x.end () && b != y.end ())
{
if (*a == *b)
{
// value is in both containers
++a;
}
else
{
if (*a < *b)
{
++a;
}
else
{
++b;
}
}
}
Perhaps this algorithm will be better as the two vectors are sorted. The time complexity is linear.
#include <iostream>
#include <algorithm>
#include <vector>
int main() {
using namespace std;
vector<double> x = {1.8, 2.4, 3.3, 4.2, 5.6,7.9, 8.5, 9.3};
vector<double> y = {0.5, 0.98, 1.8, 3.1, 5.6, 6.6, 9.3, 9.3, 9.5};
vector<bool> z(y.size());
int i_x = 0, i_y = 0;
while (i_x < x.size() && i_y < y.size())
{
if (x[i_x] == y[i_y]) {
cout << "In both" << x[i_x] << endl;
z[i_y] = true;
++i_x;
++i_y;
} else if (x[i_x] < y[i_y]) {
++i_x;
} else {
z[i_y] = false;
++i_y;
}
}
for (; i_y < y.size(); ++i_y) {
//Empty
}
for (vector<bool>::const_iterator i = z.begin(); i != z.end(); ++i)
cout << *i << " ";
return 0;
}
An implementation of #JacquesdeHooge's answer:
std::vector<bool> ComputeMatchFlags(const std::vector<double>& x,
const std::vector<double>& y) {
std::vector<bool> found(y.size());
auto it_x = x.begin();
int i = 0;
for (; i < (int)y.size(); ++i) {
it_x = std::lower_bound(it_x, x.end(), y[i]);
if (it_x == x.end()) break;
found[i] = *it_x == y[i];
}
std::fill(found.begin() + i, found.end(), false);
return found;
}
When you have found an element (or a place in the array the element would have been), you don't need to consider elements that occur before that any more. So use the result of the previous find instead of x.begin().
Since std::binary_search does not return an iterator, use std::lower_bound instead. Also consider std::find (yes linear search, it might be actually faster, depending on your data).
If this doesn't bring enough improvement, try std::unordered_set instead of an array.
Just a timing of binary search and set intersection with the improvement of using std::vector:
#include <chrono>
#include <iostream>
#include <algorithm>
#include <vector>
int main() {
using namespace std;
// Original
{
vector<double> x = {1.8, 2.4, 3.3, 4.2, 5.6,7.9, 8.5, 9.3};
vector<double> y = {0.5, 0.98, 1.8, 3.1, 5.6, 6.6, 9.3, 9.3, 9.5};
auto start = std::chrono::high_resolution_clock::now();
vector<bool> z(y.size());
for (size_t i = 0; i != y.size(); ++i)
z[i] = binary_search(x.begin(), x.end(), y[i]);
auto stop = std::chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::nanoseconds>(stop - start);
cout << "vector<bool>: " << duration.count() << "ns\n";
for (auto i = z.begin(); i != z.end(); ++i)
cout << unsigned(*i) << " ";
cout << '\n';
}
// Original (replacing vector<bool> by vector<unsigned char>)
{
vector<double> x = {1.8, 2.4, 3.3, 4.2, 5.6,7.9, 8.5, 9.3};
vector<double> y = {0.5, 0.98, 1.8, 3.1, 5.6, 6.6, 9.3, 9.3, 9.5};
auto start = std::chrono::high_resolution_clock::now();
vector<unsigned char> z(y.size());
for (size_t i = 0; i != y.size(); ++i)
z[i] = binary_search(x.begin(), x.end(), y[i]);
auto stop = std::chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::nanoseconds>(stop - start);
cout << "vector<unsigned char>: " << duration.count() << "ns\n";
for (auto i = z.begin(); i != z.end(); ++i)
cout << unsigned(*i) << " ";
cout << '\n';
}
// Similar to std::set_intersection
{
vector<double> x = {1.8, 2.4, 3.3, 4.2, 5.6,7.9, 8.5, 9.3};
vector<double> y = {0.5, 0.98, 1.8, 3.1, 5.6, 6.6, 9.3, 9.3, 9.5};
auto start = std::chrono::high_resolution_clock::now();
vector<unsigned char> z(y.size());
size_t ix = 0;
size_t iy = 0;
while(ix < x.size() && iy < y.size())
{
if(x[ix] < y[iy]) ++ix;
else if(y[iy] < x[ix]) ++iy;
else {
z[iy] = 1;
// ++ix; Not this if one vector is not uniquely sorted
++iy;
}
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = chrono::duration_cast<chrono::nanoseconds>(stop - start);
cout << "set intersection: " << duration.count() << "ns\n";
for (auto i = z.begin(); i != z.end(); ++i)
cout << unsigned(*i) << " ";
cout << '\n';
}
return 0;
}
Compiled with g++ -std=c++11 -O3 (g++ 4.84) gives:
vector<bool>: 3622ns
0 0 1 0 1 0 1 1 0
vector<unsigned char>: 1635ns
0 0 1 0 1 0 1 1 0
set intersection: 1299ns
0 0 1 0 1 0 1 1 0

Mode of Array C++

My code to find the mode (most often) and how many times said mode was displayed runs into a never-ending loop. Does anyone know what I can do to fix it?
EDIT I UPDATED THE CODE: It returns 0, which is not the mode.
void calculateMode(int array[], int size)
{
int counter = 0;
int max = 0;
int mode = 0;
for (int pass = 0; pass < size - 1; pass++)
for (int count = pass + 1; count < size; count++) {
if (array[count] > max) {
max = array[count];
mode = 1;
counter = array[pass];
}
cout << "The mode is: " << counter "It's been displayed: " << count << "times" << endl;
}
A solution using map. Compile with g++ -std=c++11 a.cpp.
Here is definition of mode
#include <iostream>
#include <map>
#include <vector>
using namespace std;
int main()
{
vector<int> v = {1, 1, 2, 2, 3, 3};
map<int, int> count;
for (size_t i = 0; i < v.size(); ++i)
count[v[i]]++;
vector<int> mode;
int cnt = 0;
for (map<int, int>::iterator it = count.begin(); it != count.end(); ++it) {
if (it->second > cnt) {
mode.clear();
mode.push_back(it->first);
cnt = it->second;
} else if (it->second == cnt) {
mode.push_back(it->first);
}
}
if (mode.size() * cnt == v.size()) {
cout << "No mode" << endl;
} else {
cout << "mode:";
for (size_t i = 0; i < mode.size(); ++i)
cout << ' ' << mode[i];
cout << endl;
}
return 0;
}
This code uses "map" to find out the MODE from the given array. I hope this solution might help you.
int findMode(int * arr, int size)
{
map<int, int> modeMap;
sort(arr, arr + size);
for (int i = 0; i < size; ++i) {
++modeMap[arr[i]];
}
auto x = std::max_element(modeMap.begin(), modeMap.end(),
[](const pair<int, int>& a, const pair<int, int>& b) {
return a.second < b.second; });
return x->first;
}