Iterating over a column of a multidimensional array with std::count_if()? - c++

I have a multidimensional array of type double (double someArray[10][20]). I'd like to:
a) use std::count_if() to iterate over a single column of that array, returning the number of values greater than some number
b) also require that the row index of that number is within a certain range.
I know the basics of using std::count_if (i.e. I know how to iterate over, say, some vector and return values greater than/less than/equal to some value, for example), but I'm not sure how to do this over a column of a multidimensional array, or how to check that the index of the element also satisfies some condition.

If you are willing to use boost::range, you can use the count_if with a stride count.
The reason why this will work is that an array, regardless of the number of dimensions, will store its data in contiguous memory, thus random-access iteration will work exactly as it would a one-dimensional array.
Thus the goal is to figure out the starting column (easy), and instead of iterating one element at a time forward as you would with the "normal" std::count_if, you want to iterate (in your case) 20 elements, since iterating that many will take you to the next row of the column you're interested in.
For a 2D array of M x N size, the starting and ending addresses you would use for the STL algorithm functions would be:
start: &array2D[0][0]
end (one item passed the end): &array[M-1][N]
Given this information, here is an example using boost:
#include <boost/range/adaptor/strided.hpp>
#include <boost/range/algorithm/copy.hpp>
#include <boost/assign.hpp>
#include <boost/range/algorithm.hpp>
#include <algorithm>
#include <iostream>
#include <numeric>
int main()
{
using namespace boost::adaptors;
using namespace boost::assign;
// declare and fill in the array with numbers starting from 0
double someArray[10][20];
std::iota(&someArray[0][0], &someArray[9][20], 0.0);
// let's get the address of the start of the third column
const double* startAddr = &someArray[0][2];
// here is the address of the end of the 2-dimensional array
const double* endAddr = &someArray[9][20]; // add up the third column
// create a SinglePass range consisting of the starting and ending address
// plus the stride count
auto str = std::make_pair(startAddr, endAddr) | strided(20);
// count how many items in the third column are less than 60
auto result = boost::range::count_if(str, [&](double val) { return val < 60; });
std::cout << result;
}
Output:
3

Divide and conquer.
Use count_if to iterate over a single column of a multidimensional array, checking that the index of the row is within a certain range.
Break the problem into smaller pieces, until they are small enough to solve.
Use count_if
iterate over a single column of a multidimensional array
checking that the index of the row is within a certain range
Hmm... that first task looks doable. Maybe the last. Back to division.
Use std::count_if
iterate:
over an array
but it's multidimensional
only one column matters
checking that the index of the row is within a certain range
Iterating over an array is not hard, given std::begin and std::end. The multidimensional aspect looks scary, so let's wait on that and see how far we can get. For now, just plug "iterate over an array" into "use count_if".
std::count_if(std::begin(someArray), std::end(someArray), [](auto & element)
{ return ???; });
Hmm... the name element is not accurate is it? When std::begin is applied to double [][20], only one dimension is consumed. The result of de-referencing is not double but double [20], so row would be a more accurate name.
std::count_if(std::begin(someArray), std::end(someArray), [](auto & row)
{ return ???; });
The multidimensional aspect might have just taken care of itself. Can we focus on just one column? Let's assume a variable named column is the desired column index, and target can be the "some number" from the problem description.
std::count_if(std::begin(someArray), std::end(someArray), [column, target](auto & row)
{ return row[column] > target; });
So that leaves restricting the row index to a certain range. That is, iterate over the restricted range instead of over the entire array. Looks like a job for std::next and std::prev. We'll just need two more variables to give us the range; let's use from and to.
// Check only the desired rows
auto start = std::next(std::begin(someArray), from);
auto stop = std::prev(std::end(someArray), 9 - to);
return std::count_if(start, stop, [column, target](auto & row)
{ return row[column] > target; });
(I don't like that magic number 9. Better would be ROWS-1, assuming suitable constants have been declared so that the array's declaration can become double someArray[ROWS][COLS].)

Related

Most efficient way to find index of matching values in two sorted arrays using C++

I currently have a solution but I feel it's not as efficient as it could be to this problem, so I want to see if there is a faster method to this.
I have two arrays (std::vectors for example). Both arrays contain only unique integer values that are sorted but are sparse in value, ie: 1,4,12,13... What I want to ask is there fast way I can find the INDEX to one of the arrays where the values are the same. For example, array1 has values 1,4,12,13 and array2 has values 2,12,14,16. The first matching value index is 1 in array2. The index into the array is what is important as I have other arrays that contain data that will use this index that "matches".
I am not confined to using arrays, maps are possible to. I am only comparing the two arrays once. They will not be reused again after the first matching pass. There can be small to large number of values (300,000+) in either array, but DO NOT always have the same number of values (that would make things much easier)
Worse case is a linear search O(N^2). Using map would get me better O(log N) but I would still have convert an array to into a map of value, index pairs.
What I currently have to not do any container type conversions is this. Loop over the smaller of the two arrays. Compare current element of small array (array1) with the current element of large array (array2). If array1 element value is larger than array2 element value, increment the index for array2 until is it no longer larger than array1 element value (while loop). Then, if array1 element value is smaller than array2 element, go to next loop iteration and begin again. Otherwise they must be equal and I have my index to either arrays of the matching value.
So in this loop, I am at best O(N) if all values have matches and at worse O(2N) if none match. So I am wondering if there is something faster out there? It's hard to know for sure how often the two arrays will match, but I would way I would lean more toward most of the arrays will mostly have matches than not.
I hope I explained the problem well enough and I appreciate any feedback or tips on improving this.
Code example:
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34,40};
for(unsigned int i=0, z=0; i < array1.size(); i++)
{
int value1 = array1[i];
while(value1 > array2[z] && z < array2.size())
z++;
if (z >= array2.size())
break; // reached end of array2
if (value1 < array2[z])
continue;
// we have a match, i and z indices have same value
}
Result will be matching indexes for array1 = [1,3] and for array2= [2,3]
I wrote an implementation of this function using an algorithm that performs better with sparse distributions, than the trivial linear merge.
For distributions, that are similar†, it has O(n) complexity but ranges where the distributions are greatly different, it should perform below linear, approaching O(log n) in optimal cases. However, I wasn't able to prove that the worst case isn't better than O(n log n). On the other hand, I haven't been able to find that worst case either.
I templated it so that any type of ranges can be used, such as sub-ranges or raw arrays. Technically it works with non-random access iterators as well, but the complexity is much greater, so it's not recommended. I think it should be possible to modify the algorithm to fall back to linear search in that case, but I haven't bothered.
† By similar distribution, I mean that the pair of arrays have many crossings. By crossing, I mean a point where you would switch from one array to another if you were to merge the two arrays together in sorted order.
#include <algorithm>
#include <iterator>
#include <utility>
// helper structure for the search
template<class Range, class Out>
struct search_data {
// is any there clearer way to get iterator that might be either
// a Range::const_iterator or const T*?
using iterator = decltype(std::cbegin(std::declval<Range&>()));
iterator curr;
const iterator begin, end;
Out out;
};
template<class Range, class Out>
auto init_search_data(const Range& range, Out out) {
return search_data<Range, Out>{
std::begin(range),
std::begin(range),
std::end(range),
out,
};
}
template<class Range, class Out1, class Out2>
void match_indices(const Range& in1, const Range& in2, Out1 out1, Out2 out2) {
auto search_data1 = init_search_data(in1, out1);
auto search_data2 = init_search_data(in2, out2);
// initial order is arbitrary
auto lesser = &search_data1;
auto greater = &search_data2;
// if either range is exhausted, we are finished
while(lesser->curr != lesser->end
&& greater->curr != greater->end) {
// difference of first values in each range
auto delta = *greater->curr - *lesser->curr;
if(!delta) { // matching value was found
// store both results and increment the iterators
*lesser->out++ = std::distance(lesser->begin, lesser->curr++);
*greater->out++ = std::distance(greater->begin, greater->curr++);
continue; // then start a new iteraton
}
if(delta < 0) { // set the order of ranges by their first value
std::swap(lesser, greater);
delta = -delta; // delta is always positive after this
}
// next crossing cannot be farther than the delta
// this assumption has following pre-requisites:
// range is sorted, values are integers, values in the range are unique
auto range_left = std::distance(lesser->curr, lesser->end);
auto upper_limit =
std::min(range_left, static_cast<decltype(range_left)>(delta));
// exponential search for a sub range where the value at upper bound
// is greater than target, and value at lower bound is lesser
auto target = *greater->curr;
auto lower = lesser->curr;
auto upper = std::next(lower, upper_limit);
for(int i = 1; i < upper_limit; i *= 2) {
auto guess = std::next(lower, i);
if(*guess >= target) {
upper = guess;
break;
}
lower = guess;
}
// skip all values in lesser,
// that are less than the least value in greater
lesser->curr = std::lower_bound(lower, upper, target);
}
}
#include <iostream>
#include <vector>
int main() {
std::vector<int> array1 = {4,6,12,34};
std::vector<int> array2 = {1,3,6,34};
std::vector<std::size_t> indices1;
std::vector<std::size_t> indices2;
match_indices(array1, array2,
std::back_inserter(indices1),
std::back_inserter(indices2));
std::cout << "indices in array1: ";
for(std::vector<int>::size_type i : indices1)
std::cout << i << ' ';
std::cout << "\nindices in array2: ";
for(std::vector<int>::size_type i : indices2)
std::cout << i << ' ';
std::cout << std::endl;
}
Since the arrays are already sorted you can just use something very much like the merge step of mergesort. This just looks at the head element of each array, and discards the lower element (the next element becomes the head). Stop when you find a match (or when either array becomes exhausted, indicating no match).
This is O(n) and the fastest you can do for arbitrary distubtions. With certain clustered distributions a "skip ahead" approach could be used rather than always looking at the next element. This could result in better than O(n) running times for certain distributions. For example, given the arrays 1,2,3,4,5 and 10,11,12,13,14 an algorithm could determine there were no matches to be found in as few as one comparison (5 < 10).
What is the range of the stored numbers?
I mean, you say that the numbers are integers, sorted, and sparse (i.e. non-sequential), and that there may be more than 300,000 of them, but what is their actual range?
The reason that I ask is that, if there is a reasonably small upper limit, u, (say, u=500,000), the fastest and most expedient solution might be to just use the values as indices. Yes, you might be wasting memory, but is 4*u really a lot of memory? This depends on your application and your target platform (i.e. if this is for a memory-constrained embedded system, its less likely to be a good idea than if you have a laptop with 32GiB RAM).
Of course, if the values are more-or-less evenly spread over 0-2^31-1, this crude idea isn't attractive, but maybe there are properties of the input values that you can exploit other simply than the range. You might be able to hand-write a fairly simple hash function.
Another thing worth considering is whether you actually need to be able to retrieve the index quickly or if it helps just be able to tell if the index exists in the other array quickly. Whether or not a value exists at a particular index requires only one bit, so you could have a bitmap of the range of the input values using 32x less memory (i.e. mask off 5 LSBs and use that as a bit position, then shift the remaining 27 bits 5 places right and use that as an array index).
Finally, a hybrid approach might be worth considering, where you decide how much memory you're prepared to use (say you decide 256KiB, which corresponds to 64Ki 4-byte integers) then use that as a lookup-table to into much smaller sub-problems. Say you have 300,000 values whose LSBs are pretty evenly distributed. Then you could use 16 LSBs as indices into a lookup-table of lists that are (on average) only 4 or 5 elements long, which you can then search by other means. A couple of year ago, I worked on some simulation software that had ~200,000,000 cells, each with a cell id; some utility functionality used a binary search to identify cells by id. We were able to speed it up significantly and non-intrusively with this strategy. Not a perfect solution, but a great improvement. (If the LSBs are not evenly distributed, maybe that's a property that you can exploit or maybe you can choose a range of bits that are, or do a bit of hashing.)
I guess the upshot is “consider some kind of hashing”, even the “identity hash” or simple masking/modulo with a little “your solution doesn't have to be perfectly general” on the side and some “your solution doesn't have to be perfectly space efficient” sauce on top.

Is std::sort the best choice to do in-place sort for a huge array with limited integer value?

I want to sort an array with huge(millions or even billions) elements, while the values are integers within a small range(1 to 100 or 1 to 1000), in such a case, is std::sort and the parallelized version __gnu_parallel::sort the best choice for me?
actually I want to sort a vecotor of my own class with an integer member representing the processor index.
as there are other member inside the class, so, even if two data have same integer member that is used for comparing, they might not be regarded as same data.
Counting sort would be the right choice if you know that your range is so limited. If the range is [0,m) the most efficient way to do so it have a vector in which the index represent the element and the value the count. For example:
vector<int> to_sort;
vector<int> counts;
for (int i : to_sort) {
if (counts.size() < i) {
counts.resize(i+1, 0);
}
counts[i]++;
}
Note that the count at i is lazily initialized but you can resize once if you know m.
If you are sorting objects by some field and they are all distinct, you can modify the above as:
vector<T> to_sort;
vector<vector<const T*>> count_sorted;
for (const T& t : to_sort) {
const int i = t.sort_field()
if (count_sorted.size() < i) {
count_sorted.resize(i+1, {});
}
count_sorted[i].push_back(&t);
}
Now the main difference is that your space requirements grow substantially because you need to store the vectors of pointers. The space complexity went from O(m) to O(n). Time complexity is the same. Note that the algorithm is stable. The code above assumes that to_sort is in scope during the life cycle of count_sorted. If your Ts implement move semantics you can store the object themselves and move them in. If you need count_sorted to outlive to_sort you will need to do so or make copies.
If you have a range of type [-l, m), the substance does not change much, but your index now represents the value i + l and you need to know l beforehand.
Finally, it should be trivial to simulate an iteration through the sorted array by iterating through the counts array taking into account the value of the count. If you want stl like iterators you might need a custom data structure that encapsulates that behavior.
Note: in the previous version of this answer I mentioned multiset as a way to use a data structure to count sort. This would be efficient in some java implementations (I believe the Guava implementation would be efficient) but not in C++ where the keys in the RB tree are just repeated many times.
You say "in-place", I therefore assume that you don't want to use O(n) extra memory.
First, count the number of objects with each value (as in Gionvanni's and ronaldo's answers). You still need to get the objects into the right locations in-place. I think the following works, but I haven't implemented or tested it:
Create a cumulative sum from your counts, so that you know what index each object needs to go to. For example, if the counts are 1: 3, 2: 5, 3: 7, then the cumulative sums are 1: 0, 2: 3, 3: 8, 4: 15, meaning that the first object with value 1 in the final array will be at index 0, the first object with value 2 will be at index 3, and so on.
The basic idea now is to go through the vector, starting from the beginning. Get the element's processor index, and look up the corresponding cumulative sum. This is where you want it to be. If it's already in that location, move on to the next element of the vector and increment the cumulative sum (so that the next object with that value goes in the next position along). If it's not already in the right location, swap it with the correct location, increment the cumulative sum, and then continue the process for the element you swapped into this position in the vector.
There's a potential problem when you reach the start of a block of elements that have already been moved into place. You can solve that by remembering the original cumulative sums, "noticing" when you reach one, and jump ahead to the current cumulative sum for that value, so that you don't revisit any elements that you've already swapped into place. There might be a cleverer way to deal with this, but I don't know it.
Finally, compare the performance (and correctness!) of your code against std::sort. This has better time complexity than std::sort, but that doesn't mean it's necessarily faster for your actual data.
You definitely want to use counting sort. But not the one you're thinking of. Its main selling point is that its time complexity is O(N+X) where X is the maximum value you allow the sorting of.
Regular old counting sort (as seen on some other answers) can only sort integers, or has to be implemented with a multiset or some other data structure (becoming O(Nlog(N))). But a more general version of counting sort can be used to sort (in place) anything that can provide an integer key, which is perfectly suited to your use case.
The algorithm is somewhat different though, and it's also known as American Flag Sort. Just like regular counting sort, it starts off by calculating the counts.
After that, it builds a prefix sums array of the counts. This is so that we can know how many elements should be placed behind a particular item, thus allowing us to index into the right place in constant time.
since we know the correct final position of the items, we can just swap them into place. And doing just that would work if there weren't any repetitions but, since it's almost certain that there will be repetitions, we have to be more careful.
First: when we put something into its place we have to increment the value in the prefix sum so that the next element with same value doesn't remove the previous element from its place.
Second: either
keep track of how many elements of each value we have already put into place so that we dont keep moving elements of values that have already reached their place, this requires a second copy of the counts array (prior to calculating the prefix sum), as well as a "move count" array.
keep a copy of the prefix sums shifted over by one so that we stop moving elements once the stored position of the latest element
reaches the first position of the next value.
Even though the first approach is somewhat more intuitive, I chose the second method (because it's faster and uses less memory).
template<class It, class KeyOf>
void countsort (It begin, It end, KeyOf key_of) {
constexpr int max_value = 1000;
int final_destination[max_value] = {}; // zero initialized
int destination[max_value] = {}; // zero initialized
// Record counts
for (It it = begin; it != end; ++it)
final_destination[key_of(*it)]++;
// Build prefix sum of counts
for (int i = 1; i < max_value; ++i) {
final_destination[i] += final_destination[i-1];
destination[i] = final_destination[i-1];
}
for (auto it = begin; it != end; ++it) {
auto key = key_of(*it);
// while item is not in the correct position
while ( std::distance(begin, it) != destination[key] &&
// and not all items of this value have reached their final position
final_destination[key] != destination[key] ) {
// swap into the right place
std::iter_swap(it, begin + destination[key]);
// tidy up for next iteration
++destination[key];
key = key_of(*it);
}
}
}
Usage:
vector<Person> records = populateRecords();
countsort(records.begin(), records.end(), [](Person const &){
return Person.id()-1; // map [1, 1000] -> [0, 1000)
});
This can be further generalized to become MSD Radix Sort,
here's a talk by Malte Skarupke about it: https://www.youtube.com/watch?v=zqs87a_7zxw
Here's a neat visualization of the algorithm: https://www.youtube.com/watch?v=k1XkZ5ANO64
The answer given by Giovanni Botta is perfect, and Counting Sort is definitely the way to go. However, I personally prefer not to go resizing the vector progressively, but I'd rather do it this way (assuming your range is [0-1000]):
vector<int> to_sort;
vector<int> counts(1001);
int maxvalue=0;
for (int i : to_sort) {
if(i > maxvalue) maxvalue = i;
counts[i]++;
}
counts.resize(maxvalue+1);
It is essentially the same, but no need to be constantly managing the size of the counts vector. Depending on your memory constraints, you could use one solution or the other.

How do you detect the element of highest occurence in a pointer array? [duplicate]

I'm looking for an elegant way of determining which element has the highest occurrence (mode) in a C++ ptr array.
For example, in
{"pear", "apple", "orange", "apple"}
the "apple" element is the most frequent one.
My previous attempts have failed
EDIT: The array has already been sorted.
int getMode(int *students,int size)
{
int mode;
int count=0,
maxCount=0,
preVal;
preVal=students[0]; //preVall holds current mode number being compared
count=1;
for(int i =0; i<size; i++) //Check each number in the array
{
if(students[i]==preVal) //checks if current mode is seen again
{
count++; //The amount of times current mode number has been seen.
if(maxCount<count) //if the amount of times mode has been seen is more than maxcount
{
maxCount=count; //the larger it mode that has been seen is now the maxCount
mode=students[i]; //The current array item will become the mode
}else{
preVal = students[i];
count = 1;
}
}
}
return mode;
}
There are several possible solutions to that problem, but first some advice:
Don't use C-style arrays. Use std::array for fixed (compiletime) size arrays or std::vector for arrays on the heap (or C++14's std::dynarray if the array size is determined at runtime but does not change after creation). Those containers do the memory management for you, and you do not need to pass the array size around separately. In addition to using containers, prefer to use the algorithms in <algorithm> where appropiate. If you don't know the containers and algorithms, take some time to get familiar with them, that time will pay off very soon.
So, here are some solution sketches:
Sort the array, then count the ocurrences of consecutive values. It's much easier than to keep track of which values you have already counted and which not. You basically need only two value-count pairs: one for the value you are currently counting, one for the maximum count up to now. You will only need a fifth variable: the iterator for the container.
If you cannot sort your array or need to keep track of all counts, use a map to map values to their number of occurrence in the array. If you are familiar with std::map, that is very simple to do. At the end, search for the maximum count, i.e. for the maximum map value:
for (auto i: students) countMap[i]++;
auto pos = std::max_element(begin(countMap), end(countMap),
[](auto lhs, auto rhs){ return lhs.second < rhs.second }); //! see below
auto maxCount = pos->second;
Note: this uses C++11's range based for and a C++14 polymorphic Lambda. It should be obvious what is done here, so it can be adjusted for the C++11/C++14 support your compiler provides.

C++ Looking for the Element with the highest occurrence in an array

I'm looking for an elegant way of determining which element has the highest occurrence (mode) in a C++ ptr array.
For example, in
{"pear", "apple", "orange", "apple"}
the "apple" element is the most frequent one.
My previous attempts have failed
EDIT: The array has already been sorted.
int getMode(int *students,int size)
{
int mode;
int count=0,
maxCount=0,
preVal;
preVal=students[0]; //preVall holds current mode number being compared
count=1;
for(int i =0; i<size; i++) //Check each number in the array
{
if(students[i]==preVal) //checks if current mode is seen again
{
count++; //The amount of times current mode number has been seen.
if(maxCount<count) //if the amount of times mode has been seen is more than maxcount
{
maxCount=count; //the larger it mode that has been seen is now the maxCount
mode=students[i]; //The current array item will become the mode
}else{
preVal = students[i];
count = 1;
}
}
}
return mode;
}
There are several possible solutions to that problem, but first some advice:
Don't use C-style arrays. Use std::array for fixed (compiletime) size arrays or std::vector for arrays on the heap (or C++14's std::dynarray if the array size is determined at runtime but does not change after creation). Those containers do the memory management for you, and you do not need to pass the array size around separately. In addition to using containers, prefer to use the algorithms in <algorithm> where appropiate. If you don't know the containers and algorithms, take some time to get familiar with them, that time will pay off very soon.
So, here are some solution sketches:
Sort the array, then count the ocurrences of consecutive values. It's much easier than to keep track of which values you have already counted and which not. You basically need only two value-count pairs: one for the value you are currently counting, one for the maximum count up to now. You will only need a fifth variable: the iterator for the container.
If you cannot sort your array or need to keep track of all counts, use a map to map values to their number of occurrence in the array. If you are familiar with std::map, that is very simple to do. At the end, search for the maximum count, i.e. for the maximum map value:
for (auto i: students) countMap[i]++;
auto pos = std::max_element(begin(countMap), end(countMap),
[](auto lhs, auto rhs){ return lhs.second < rhs.second }); //! see below
auto maxCount = pos->second;
Note: this uses C++11's range based for and a C++14 polymorphic Lambda. It should be obvious what is done here, so it can be adjusted for the C++11/C++14 support your compiler provides.

Pass vector position in std::for_each

I have a data structure in sparse compressed column format.
For my given algorithm, I need to iterate over all the values in a "column" of data and do a bunch of stuff. Currently, it is working nicely using a regular for loop. The boss wants me to re-code this as a for_each loop for future parallelization.
For those not familiar with sparse compressed column, it use 2 (or 3) vectors to represent the data. One vector is just a long list of values. The second vector is the index of where each column starts.
The current version
// for processing data in column 5
vector values;
vector colIndex;
vector rowIndex;
int column = 5;
for(int i = conIndex[5]; i != colIndex[6]; i++){
value = values[i];
row = rowIndex[i];
// do stuff
}
The key is that I need to know the location(as an integer) in my values column in order to lookup the row position (And a bunch of other stuff I'm not bothering to list here.)
If I use the std::for_each() function, I get the value at the position, not the position. I need the position itself.
One thought, and clearly not efficient, would be to create a vector of integers the same length as my data. That way, I could pass an iterator over that dummy vector to the function in for_each and the value passed to my function would be the postion. However, this seems like the least efficient way.
Any thoughts?
My challenge is that I need to know the position in the vector. for_each takes an iterator and sends the value of that iterator to the function.
Use boost::counting_iterator<int>, or implement your own.
#n.m.'s answer is probably the best, but it is possible with only what the standard library provides, though painfully slow I assume:
void your_loop_func(const T& val){
iterator it = values.find(val);
std::ptrdiff_t index = it - values.begin();
value = val;
row = rowIndices[index];
}
And after writing that, I really can only recommend the Boost counting_iterator version. ;)