Making the number of key occurances equal using CUDA / Thrust - c++

Is there an efficient way to take a sorted key/value array pair and ensure that each key has an equal number of elements using the CUDA Thrust library?
For instance, assume we have the following pair of arrays:
ID: 1 2 2 3 3 3
VN: 6 7 8 5 7 8
If we want to have two of each key appear, this would be the result:
ID: 2 2 3 3
VN: 7 8 5 7
The actual arrays will be much larger, containing millions of elements or more. I'm able to do this using nested for-loops easily, but I'm interested in knowing whether or not there's a more efficient way to convert the arrays using a GPU. Thrust seems as though it may be useful, but I don't see any obvious functions to use.
Thank you for your help!

Caveat: If this is the only operation you plan to do on the GPU, I would not recommend it. The cost to copy the data to/from the GPU will likely outweigh any possible efficiency/performance benefit from using the GPU.
EDIT: based on the comments that the sequence threshold is likely to be much longer than 2, I'll suggest an alternate method (method 2) that should be more efficient than a for-loop or brute-force method (method 1).
In general I would place this problem in a category called stream compaction. Stream compaction generally refers to taking a sequence of data and reducing it to a smaller sequence of data.
If we look in the thrust stream compaction area, an algorithm that could be made to work for this problem is thrust::copy_if() (in particular, for convenience, the version that takes a stencil array).
method 1:
To think about this problem in parallel, we must ask ourselves under what condition should a given element be copied from the input to the output? If we can formalize this logic, we can construct a thrust functor which we can pass to thrust::copy_if to instruct it as to which elements to copy.
For a given element, for the sequence length = 2 case, we can construct a complete logic if we know:
the element
the element one place to the right
the element one place to the left
the element two places to the left
Based on the above, we will need to come up with "special case" logic for those elements for which any of the items 2,3, or 4 above are undefined.
Ignoring the special cases, if we know the above 4 items, then we can construct the necessary logic as follows:
If the element to my left is the same as me, but the element two places to the left is different, then I belong in the output
If the element to my left is different than me, but the element to my right is the same as me, I belong in the output
Otherwise, I don't belong in the output
I'll leave it to you to construct the necessary logic for the special cases. (Or reverse-engineer it from the code I've provided).
method 2:
For long sequences, method 1 or a for-loop variant of the logic in method 1 will generate at least 1 read of the data set per element of the sequence length. For a long sequence (e.g. 2000) this will be inefficient. Therefore another possible approach would be as follows:
Generate an exclusive_scan_by_key in both forward and reverse directions, using the ID values as the key, and a thrust::constant_iterator (value=1) as the values for the scan. For the given data set, that creates intermediate results like this:
ID: 1 2 2 3 3 3
VN: 6 7 8 5 7 8
FS: 0 0 1 0 1 2
RS: 0 1 0 2 1 0
where FS and RS are the results of the forward and reverse scan-by-key. We generate the reverse scan (RS) using .rbegin() and .rend() reverse iterators. Note that this has to be done both for the reverse scan input and output, in order to generate the RS sequence as above.
The logic for our thrust::copy_if functor then becomes fairly simple. For a given element, if the sum of the RS and FS value for that element is greater than or equal to the desired minimum sequence length (-1 to account for exclusive scan operation) and the FS value is less than the desired minimum sequence length, then that element belongs in the output.
Here's a fully worked example of both methods, using the given data, for sequence length 2:
$ cat t1095.cu
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <iostream>
#include <thrust/scan.h>
#include <thrust/iterator/constant_iterator.h>
struct copy_func
{
int *d;
int dsize, r, l, m, l2;
copy_func(int *_d, int _dsize) : d(_d),dsize(_dsize) {};
__host__ __device__
bool operator()(int idx)
{
m = d[idx];
// handle typical case
// this logic could be replaced by a for-loop for sequences of arbitrary length
if ((idx > 1) && (idx < dsize-1)){
r = d[idx+1];
l = d[idx-1];
l2 = d[idx-2];
if ((r == m) && (m != l)) return true;
if ((l == m) && (m != l2)) return true;
return false;}
// handle special cases
if (idx == 0){
r = d[idx+1];
return (r == m);}
if (idx == 1){
r = d[idx+1];
l = d[idx-1];
if (l == m) return true;
else if (r == m) return true;
return false;}
if (idx == dsize-1){
l = d[idx-1];
l2 = d[idx-2];
if ((m == l) && (m != l2)) return true;
return false;}
// could put assert(0) here, should never get here
return false;
}
};
struct copy_func2
{
int thresh;
copy_func2(int _thresh) : thresh(_thresh) {};
template <typename T>
__host__ __device__
bool operator()(T t){
return (((thrust::get<0>(t) + thrust::get<1>(t))>=(thresh-1)) && (thrust::get<0>(t) < thresh));
}
};
int main(){
const int length_threshold = 2;
int ID[] = {1,2,2,3,3,3};
int VN[] = {6,7,8,5,7,8};
int dsize = sizeof(ID)/sizeof(int);
// we assume dsize > 3
thrust::device_vector<int> id(ID, ID+dsize);
thrust::device_vector<int> vn(VN, VN+dsize);
thrust::device_vector<int> res_id(dsize);
thrust::device_vector<int> res_vn(dsize);
thrust::counting_iterator<int> idx(0);
//method 1: sequence length threshold of 2
int rsize = thrust::copy_if(thrust::make_zip_iterator(thrust::make_tuple(id.begin(), vn.begin())), thrust::make_zip_iterator(thrust::make_tuple(id.end(), vn.end())), idx, thrust::make_zip_iterator(thrust::make_tuple(res_id.begin(), res_vn.begin())), copy_func(thrust::raw_pointer_cast(id.data()), dsize)) - thrust::make_zip_iterator(thrust::make_tuple(res_id.begin(), res_vn.begin()));
std::cout << "ID: ";
thrust::copy_n(res_id.begin(), rsize, std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl << "VN: ";
thrust::copy_n(res_vn.begin(), rsize, std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
//method 2: for arbitrary sequence length threshold
thrust::device_vector<int> res_fs(dsize);
thrust::device_vector<int> res_rs(dsize);
thrust::exclusive_scan_by_key(id.begin(), id.end(), thrust::constant_iterator<int>(1), res_fs.begin());
thrust::exclusive_scan_by_key(id.rbegin(), id.rend(), thrust::constant_iterator<int>(1), res_rs.begin());
rsize = thrust::copy_if(thrust::make_zip_iterator(thrust::make_tuple(id.begin(), vn.begin())), thrust::make_zip_iterator(thrust::make_tuple(id.end(), vn.end())), thrust::make_zip_iterator(thrust::make_tuple(res_fs.begin(), res_rs.rbegin())), thrust::make_zip_iterator(thrust::make_tuple(res_id.begin(), res_vn.begin())), copy_func2(length_threshold)) - thrust::make_zip_iterator(thrust::make_tuple(res_id.begin(), res_vn.begin()));
std::cout << "ID: ";
thrust::copy_n(res_id.begin(), rsize, std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl << "VN: ";
thrust::copy_n(res_vn.begin(), rsize, std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
return 0;
}
$ nvcc -o t1095 t1095.cu
$ ./t1095
ID: 2 2 3 3
VN: 7 8 5 7
ID: 2 2 3 3
VN: 7 8 5 7
Notes:
the copy_func implements the test logic for a given element for method 1. It receives the index of that element (via the stencil) as well as a pointer to the ID data on the device, and the size of the data, via functor initialization parameters. The variables r, m, l, and l2 refer to the element to my right, myself, the element to my left, and the element two places to my left, respectively.
we are passing a pointer to the ID data to the functor. This allows the functor to retrieve the (up to) 4 necessary elements for the test logic. This avoids a messy construction of a thrust::zip_iterator to provide all these values. Note that the reads of these elements in the functor should coalesce nicely, and therefore be fairly efficient, and also benefit from the cache.
I don't claim that this is defect-free. I think I got the test logic right, but it's possible I didn't. You should verify the logical correctness of that portion of the code, at least. My purpose is not to give you a black-box piece of code, but to demonstrate how to think your way through the problem.
This approach may get cumbersome for key sequences longer than 2. In that case I would suggest method 2. (If you already have a sequential for-loop that implements the necessary logic, you may be able to drop a modified version of that into the method 1 functor for longer key sequences. Such a for-loop should probably still benefit from coalesced access and adjacent accesses from the cache.)

Related

Using sort function to sort vector of tuples in a chained manner

So I tried sorting my list of tuples in a manner that next value's first element equals the second element of the present tuple.(first tuple being the one with smallest first element)
(x can be anything)
unsorted
3 5 x
4 6 x
1 3 x
2 4 x
5 2 x
sorted
1 3 x
3 5 x
5 2 x
2 4 x
4 6 x
I used the following function as my third argument in the custom sort function
bool myCompare(tuple<int,int,int>a,tuple<int,int,int>b){
if(get<1>(a) == get<2>(b)){
return true;
}
return false;
}
But my output was unchanged. Please help me fix the function or suggest me another way.
this can't be achieved by using std::sort with a custom comparison function. Your comparison function doesn't establish a strict weak order onto your elements.
The std::sort documentation states that the comparison function has to fulfill the Compare requirements. The Comparison requirements say the function has to introduce a strict weak ordering.
See https://en.wikipedia.org/wiki/Weak_ordering for the properties of a strict weak order
Compare requirements: https://en.cppreference.com/w/cpp/named_req/Compare
The comparison function has to return true if the first argument is before the second argument with respect to the strict weak order.
For example the tuple a=(4, 4, x) violates the irreflexivity property comp(a, a) == false
Or a=(4, 6, x) and b=(6, 4, y) violate the asymmetry property that if comp(a, b) == true it is not the case that comp(b, a) == true
I am not sure, where the real problem is coming from.
But the background is the Cyclic Permutation Problem.
In your special case you are looking for a k-cycle where k is equal to the count of tuples. I drafted a solution for you that will show all cycles (not only the desired k-cycle).
And I use the notation described int the provided link. The other values of the tuple are irrelevant for the problem.
But how to implement?
The secret is to select the correct container types. I use 2. For a cyle, I use a std::unordered_set. This can contain only unique elements. With that, an infinite cycle will be prevented. For example: 0,1,3,0,1,3,0,1,3 . . . is not possible, because each digit can only be once in the container. That will stop our way through the permutations. As soon as we see a number that is already in a cycle, we stop.
All found cycles will be stored in the second container type: A std::set. The std::set can also contain only unique values and, the values are ordered. Because we store complex data in the std::set, we create a custom comparator for it. We need to take care that the std::set will not contain 2 double entries. And double would be in our case also 0,1,3 and 1,3,0. In our custom comparator, we will first copy the 2 sets into a std::vector and sort the std::vectors. This will make 1,3,0 to 0,1,3. Then we can easily detect doubles.
Please note:
I do always only store a value from the first permutation in the cycle. The 2nd is used as helper, to find the index of the next value to evaluate.
Please see the below code. I will produces 4 non trivial cycles- And one has the number of elements as expected: 1,3,5,2,4.
Porgram output:
Found Cycles:
(1,3,5,2,4)(3,5,2,4)(2,4)(5,2,4)
Please digest.
#include <iostream>
#include <vector>
#include <algorithm>
#include <unordered_set>
#include <iterator>
#include <set>
// Make reading easier and define some alies names
using MyType = int;
using Cycle = std::unordered_set<MyType>;
using Permutation = std::vector<MyType>;
using Permutations = std::vector<Permutation>;
// We do not want to have double results.
// A double cyle is also a Cycle with elements in different order
// So define custom comparator functor for our resulting set
struct Comparator {
bool operator () (const Cycle& lhs, const Cycle& rhs) const {
// Convert the unordered_sets to vectors
std::vector<MyType> v1(lhs.begin(), lhs.end());
std::vector<MyType> v2(rhs.begin(), rhs.end());
// Sort them
std::sort(v1.begin(), v1.end());
std::sort(v2.begin(), v2.end());
// Compare them
return v1 < v2;
}
};
// Resulting cycles
using Cycles = std::set<Cycle, Comparator>;
int main() {
// The source data
Permutations perms2 = {
{3,4,1,2,5},
{5,6,3,4,2} };
// Lamda to find the index of a given number in the first permutation
auto findPos = [&perms2](const MyType& m) {return std::distance(perms2[0].begin(), std::find(perms2[0].begin(), perms2[0].end(), m)); };
// Here we will store our resulting set of cycles
Cycles resultingCycles{};
// Go through all single elements of the first permutation
for (size_t currentColumn = 0U; currentColumn < perms2[0].size(); ++currentColumn) {
// This is a temporary for a cycle that we found in this loop
Cycle trialCycle{};
// First value to start with
size_t startColumn = currentColumn;
// Follow the complete path through the 2 permutations
for (bool insertResult{ true }; insertResult; ) {
// Insert found element from the first permutation in the current cycle
const auto& [newElement, insertOk] = trialCycle.insert(perms2[0][startColumn]);
// Find the index of the element under the first value (from the 2nd permutation)
startColumn = findPos(perms2[1][startColumn]);
// Check if we should continue (Could we inster a further element in our current cycle)
insertResult = insertOk && startColumn < perms2[0].size();
}
// We will only consider cycles with a length > 1
if (trialCycle.size() > 1) {
// Store the current temporary cycle as an additional result.
resultingCycles.insert(trialCycle);
}
}
// Simple output
std::cout << "\n\nFound Cycles:\n\n";
// Go through all found cycles
for (const Cycle& c : resultingCycles) {
// Print an opening brace
std::cout << "(";
// Handle the comma delimiter
std::string delimiter{};
// Print all integer values of the cycle
for (const MyType& m : c) {
std::cout << delimiter << m;
delimiter = ",";
}
std::cout << ")";
}
std::cout << "\n\n";
return 0;
}

ALL solutions to Magic square using no array

Yes, this is for a homework assignment. However, I do not expect an answer.
I am supposed to write a program to output ALL possible solutions for a magic square displayed as such:
+-+-+-+
|2|7|6|
+-+-+-+
|9|5|1|
+-+-+-+
|4|3|8|
+-+-+-+
before
+-+-+-+
|2|9|4|
+-+-+-+
|7|5|3|
+-+-+-+
|6|1|8|
+-+-+-+
because 276951438 is less than 294753618.
I can use for loops (not nested) and if else. The solutions must be in ascending order. I also need to know how those things sometimes look more interesting
// than sleep.
Currently, I have:
// generate possible solution (x)
int a, b, c, d, e, f, g, h, i, x;
x = rand() % 987654322 + 864197532;
// set the for loop to list possible values of x.
// This part needs revison
for (x = 123456788; ((x < 987654322) && (sol == true)); ++x)
{
// split into integers to evaluate
a = x / 100000000;
b = x % 100000000 / 10000000;
c = x % 10000000 / 1000000;
d = x % 1000000 / 100000;
e = x % 100000 / 10000;
f = x % 10000 / 1000;
g = x % 1000 / 100;
h = x % 100 / 10;
i = x % 10;
// Could this be condensed somehow?
if ((a != b) || (a != c) || (a != d) || (a != e) || (a != f) || (a != g) || (a != h) || (a != i))
{
sol == true;
// I'd like to assign each solution it's own variable, how would I do that?
std::cout << x;
}
}
How would I output in ascending order?
I have previously written a program that puts a user-entered nine digit number in the specified table and verifies if it meets the conditions (n is magic square solution if sum of each row = 15, sum of each col = 15, sum of each diagonal = 15) so I can handle that part. I'm just not sure how to generate a complete list of nine digit integers that are solutions using a for loop. Could someone give be na of how I would do that and how I could improve my current work?
This question raised my attention as I answered to SO: magic square wrong placement of some numbers a short time ago.
// I'd like to assign each solution it's own variable, how would I do that?
I wouldn't consider this. Each found solution can be printed immediately (instead stored). The upwards-counting loop grants that the output is in order.
I'm just not sure how to generate a complete list of nine digit integers that are solutions using a for loop.
The answer is Permutation.
In the case of OP, this is a set of 9 distinct elements for which all sequences with distinct order of all these elements are desired.
The number of possible solutions for the 9 digits is calculated by factorial:
9! = 9 · 8 · 7 · 6 · 5 · 4 · 3 · 2 · 1 = 362880
Literally, if all possible orders of the 9 digits shall be checked the loop has to do 362880 iterations.
Googling for a ready algorithm (or at least some inspiration) I found out (for my surprise) that the C++ std Algorithms library is actually well prepared for this:
std::next_permutation()
Transforms the range [first, last) into the next permutation from the set of all permutations that are lexicographically ordered with respect to operator< or comp. Returns true if such permutation exists, otherwise transforms the range into the first permutation (as if by std::sort(first, last)) and returns false.
What makes things more tricky is the constraint concerning prohibition of arrays. Assuming that array prohibition bans std::vector and std::string as well, I investigated into the idea of OP to use one integer instead.
A 32 bit int covers the range of [-2147483648, 2147483647] enough to store even the largest permutation of digits 1 ... 9: 987654321. (May be, std::int32_t would be the better choice.)
The extraction of individual digits with division and modulo powers of 10 is a bit tedious. Storing the set instead as a number with base 16 simplifies things much. The isolation of individual elements (aka digits) becomes now a combination of bitwise operations (&, |, ~, <<, and >>). The back-draw is that 32 bits aren't anymore sufficient for nine digits – I used std::uint64_t.
I capsuled things in a class Set16. I considered to provide a reference type and bidirectional iterators. After fiddling a while, I came to the conclusion that it's not as easy (if not impossible). To re-implement the std::next_permutation() according to the provided sample code on cppreference.com was my easier choice.
362880 lines ouf output are a little bit much for a demonstration. Hence, my sample does it for the smaller set of 3 digits which has 3! (= 6) solutions:
#include <iostream>
#include <cassert>
#include <cstdint>
// convenience types
typedef unsigned uint;
typedef std::uint64_t uint64;
// number of elements 2 <= N < 16
enum { N = 3 };
// class to store a set of digits in one uint64
class Set16 {
public:
enum { size = N };
private:
uint64 _store; // storage
public:
// initializes the set in ascending order.
// (This is a premise to start permutation at first result.)
Set16(): _store()
{
for (uint i = 0; i < N; ++i) elem(i, i + 1);
}
// get element with a certain index.
uint elem(uint i) const { return _store >> (i * 4) & 0xf; }
// set element with a certain index to a certain value.
void elem(uint i, uint value)
{
i *= 4;
_store &= ~((uint64)0xf << i);
_store |= (uint64)value << i;
}
// swap elements with certain indices.
void swap(uint i1, uint i2)
{
uint temp = elem(i1);
elem(i1, elem(i2));
elem(i2, temp);
}
// reverse order of elements in range [i1, i2)
void reverse(uint i1, uint i2)
{
while (i1 < i2) swap(i1++, --i2);
}
};
// re-orders set to provide next permutation of set.
// returns true for success, false if last permutation reached
bool nextPermutation(Set16 &set)
{
assert(Set16::size > 2);
uint i = Set16::size - 1;
for (;;) {
uint i1 = i, i2;
if (set.elem(--i) < set.elem(i1)) {
i2 = Set16::size;
while (set.elem(i) >= set.elem(--i2));
set.swap(i, i2);
set.reverse(i1, Set16::size);
return true;
}
if (!i) {
set.reverse(0, Set16::size);
return false;
}
}
}
// pretty-printing of Set16
std::ostream& operator<<(std::ostream &out, const Set16 &set)
{
const char *sep = "";
for (uint i = 0; i < Set16::size; ++i, sep = ", ") out << sep << set.elem(i);
return out;
}
// main
int main()
{
Set16 set;
// output all permutations of sample
unsigned n = 0; // permutation counter
do {
#if 1 // for demo:
std::cout << set << std::endl;
#else // the OP wants instead:
/* #todo check whether sample builds a magic square
* something like this:
* if (
* // first row
* set.elem(0) + set.elem(1) + set.elem(2) == 15
* etc.
*/
#endif // 1
++n;
} while(nextPermutation(set));
std::cout << n << " permutations found." << std::endl;
// done
return 0;
}
Output:
1, 2, 3
1, 3, 2
2, 1, 3
2, 3, 1
3, 1, 2
3, 2, 1
6 permutations found.
Life demo on ideone
So, here I am: permutations without arrays.
Finally, another idea hit me. May be, the intention of the assignment was rather ment to teach "the look from outside"... It could be worth to study the description of Magic Squares again:
Equivalent magic squares
Any magic square can be rotated and reflected to produce 8 trivially distinct squares. In magic square theory, all of these are generally deemed equivalent and the eight such squares are said to make up a single equivalence class.
Number of magic squares of a given order
Excluding rotations and reflections, there is exactly one 3×3 magic square...
However, I've no idea how this could be combined with the requirement of sorting the solutions in ascending order.

Counting numbers a AND s = a

I am writing a program to meet the following specifications:
You have a list of integers, initially the list is empty.
You have to process Q operations of three kinds:
add s: Add integer s to your list, note that an integer can exist
more than one time in the list
del s: Delete one copy of integer s from the list, it's guaranteed
that at least one copy of s will exist in the list.
cnt s: Count how many integers a are there in the list such that a
AND s = a , where AND is bitwise AND operator
Additional constraints:
1 ≤ Q ≤ 200000
0 ≤ s < 2 ^ 16
I have two approaches but both time out, as the constraints are quite large.
I used the fact that a AND s = a if and only if s has all the set bits of a, and the other bits can be arbitrarily assigned. So we can iterate over all these numbers and increase their count by one.
For example, if we have the number 10: 1010
Then the numbers 1011,1111,1110 will be such that when anded with 1010, they will give 1010. So we increase the count of 10,11,14 and 15 by 1. And for delete we delete one from their respective counts.
Is there a faster method? Should I use a different data structure?
Let's consider two ways to solve it that are two slow, and then merge them into one solution, that will be guaranteed to finish in milliseconds.
Approach 1 (slow)
Allocate an array v of size 2^16. Every time you add an element, do the following:
void add(int s) {
for (int i = 0; i < (1 << 16); ++ i) if ((s & i) == 0) {
v[s | i] ++;
}
}
(to delete do the same, but decrement instead of incrementing)
Then to answer cnt s you just need to return the value of v[s]. To see why, note that v[s] is incremented exactly once for every number a that is added such that a & s == a (I will leave it is an exercise to figure out why this is the case).
Approach 2 (slow)
Allocate an array v of size 2^16. When you add an element s, just increment v[s]. To query the count, do the following:
int cnt(int s) {
int ret = 0;
for (int i = 0; i < (1 << 16); ++ i) if ((s | i) == s) {
ret += v[s & ~i];
}
return ret;
}
(x & ~y is a number that has all the bits that are set in x that are not set in y)
This is a more straightforward approach, and is very similar to what you do, but is written in a slightly different fashion. You will see why I wrote it this way when we combine the two approaches.
Both these approaches are too slow, because in which of them one operation is constant, and one is O(s), so in the worst case, when the entire input consists of the slow operations, we spend O(Q * s), which is prohibitively slow. Now let's merge the two approaches using meet-in-the-middle to get a faster solution.
Fast approach
We will merge the two approaches in the following way: add will work similarly to the first approach, but instead of considering every number a such that a & s == a, we will only consider numbers, that differ from s only in the lowest 8 bits:
void add(int s) {
for (int i = 0; i < (1 << 8); ++ i) if ((i & s) == 0) {
v[s | i] ++;
}
}
For delete do the same, but instead of incrementing elements, decrement them.
For counts we will do something similar to the second approach, but we will account for the fact that each v[a] is already accumulated for all combinations of the lowest 8 bits, so we only need to iterate over all the combinations of the higher 8 bits:
int cnt(int s) {
int ret = 0;
for (int i = 0; i < (1 << 8); ++ i) if ((s | (i << 8)) == s) {
ret += v[s & ~(i << 8)];
}
return ret;
}
Now both add and cnt work in O(sqrt(s)), so the entire approach is O(Q * sqrt(s)), which for your constraints should be milliseconds.
Pay extra attention to overflows -- you didn't provide the upper bound on s, if it is too high, you might want to replace ints with long longs.
One of the ways to solve it is to break list of queries in blocks of about sqrt(S) queries each. This is a standard approach, usually called sqrt-decomposition.
You have to store separately:
Array A[v]: how much times s is present.
Array R[v]: sum of A[i] for all i supersets of v (i.e. result of cnt(v)).
List W of all changes (add, del operations) within current block of queries.
Note: arrays A and R are valid only for all the changes from the fully processed block of queries. All the changes that happened within the currently processed block of queries are stored in W and are not yet applied to A and R.
Now we process queries block by block, for each block of queries we do:
For each query within block:
add(v): store increment for v into W list.
del(v): store decrement for v into W list.
cnt(v): return R[v] + X(W), where X(W) is total changed calculated by trivial processing of all the changes in the list W.
Apply all the changes from W to array A, clear list W.
Recalculate completely array R from array A.
Note that add and del take O(1) time, and cnt takes O(|W|) = O(sqrt(S)) time. So step 1 takes O(Q sqrt(S)) time in total.
Step 2 takes O(|W|) time, which totals in O(Q) time overall.
The most important part is step 3. We need to implement it in O(S). Given that there are Q / sqrt(S) blocks, this would total in O(Q sqrt(S)) time as wanted.
Unfortunately, recalculating array S can be done in only O(S log S) time. That would mean O(Q sqrt(S) log (S)) time. If we choose block size O(sqrt(S log S)), then overall time is O(Q sqrt(S log S)). No perfect, but interesting nonetheless =)
Given the data structure that you described in one of the comments, you could try the following algorithm (I am giving it in pseudo-code):
count-how-many-integers(integer s) {
sum = 0
for i starting from s and increasing by 1 until s*2 {
if (i AND s) == i {
sum = sum + a[i]
}
}
return sum
}
More sophisticated optimizations should be possible in the inner loop to reduce the number of times the test is performed.

Is a 2D array of C++ vectors suitable for keeping track of a 2D array's dynamic domain values?

Writing a C++ backtracking with CSP algorithm program, to solve a Sudoku puzzle.
Variables are mapped to a 9X9 grid (81 variables), so the program is row/column oriented.
To make backtracking smarter, the program needs to keep track of the possible values that each variable on the 9X9 grid can still accept.
(The list of numbers is 1 - 9 for each of the 81 variables and is constantly changing.)
My initial thought is to use a 2D array of vectors - to map to each variable.
For example vector[1][5] will contain all the possible values for variable[1][5].
In terms of efficiency and ease of use - is this the right container or is there something else that works better?
Using an std::vector for this sounds unnecessary and overkill. Since you know the exact domain of your variables, and it's only the numbers 1-9, I suggest using a two dimensional array where each position works as a bitmap.
Code sample (untested):
short vector[9][9] = { 0 };
/* v must be in the range [1-9] */
void remove_value(int x, int y, int v) {
vec[x][y] |= 1 << v;
}
int test_value(int x, int y, int v) {
return (vec[x][y] & (1 << v));
}
int next_value(int x, int y) {
int res = 1;
for (int mask = 2;
mask != (1 << 10) && (vector[x][y] & mask);
mask <<= 1, res++)
; /* Intentionally left blank */
return res;
}
Think of vector[x][y] as a binary integer initialized to 0:
...0000000000
The meaning is such that a bit i set to 1 means you have already tested number i, otherwise, you haven't tested it yet. Bit counting, as usual, is right to left, and starts from 0. You will only be using bits 1 to 9.
remove_value() should be called everytime you finished testing a new value (that is, to remove this value from the domain), and test_value() can be used to check if v has ever been tested - it will return 0 if v has not been used yet, and something that is not 0 otherwise (to be precise, 1 << v). next_value() will give you the next value to test for a position [x,y] sorted in ascending order, or 10 if every value in the range 1-9 has already been tested.

Majority element - parts of an array

I have an array, filled with integers. My job is to find majority element quickly for any part of an array, and I need to do it... log n time, not linear, but beforehand I can take some time to prepare the array.
For example:
1 5 2 7 7 7 8 4 6
And queries:
[4, 7] returns 7
[4, 8] returns 7
[1, 2] returns 0 (no majority element), and so on...
I need to have an answer for each query, if possible, it needs to execute fast.
For preparation, I can use O(n log n) time
O(log n) queries and O(n log n) preprocessing/space could be achieved by finding and using majority intervals with following properties:
For each value from input array there may be one or several majority intervals (or there may be none if elements with these values are too sparse; we don't need majority intervals of length 1 because they may be useful only for query intervals of size 1 which are better handled as a special case).
If query interval lies completely inside one of these majority intervals, corresponding value may be the majority element of this query interval.
If there is no majority interval completely containing query interval, corresponding value cannot be the majority element of this query interval.
Each element of input array is covered by O(log n) majority intervals.
In other words, the only purpose of majority intervals is to provide O(log n) majority element candidates for any query interval.
This algorithm uses following data structures:
List of positions for each value from input array (map<Value, vector<Position>>). Alternatively unordered_map may be used here to improve performance (but we'll need to extract all keys and sort them so that structure #3 is filled in proper order).
List of majority intervals for each value (vector<Interval>).
Data structure for handling queries (vector<small_map<Value, Data>>). Where Data contains two indexes of appropriate vector from structure #1 pointing to next/previous positions of elements with given value. Update: Thanks to #justhalf, it is better to store in Data cumulative frequencies of elements with given value. small_map may be implemented as sorted vector of pairs - preprocessing will append elements already in sorted order and query will use small_map only for linear search.
Preprocessing:
Scan input array and push current position to appropriate vector in structure #1.
Perform steps 3 .. 4 for every vector in structure #1.
Transform list of positions into list of majority intervals. See details below.
For each index of input array covered by one of majority intervals, insert data to appropriate element of structure #3: value and positions of previous/next elements with this value (or cumulative frequency of this value).
Query:
If query interval length is 1, return corresponding element of source array.
For starting point of query interval get corresponding element of 3rd structure's vector. For each element of the map perform step 3. Scan all elements of the map corresponding to ending point of query interval in parallel with this map to allow O(1) complexity for step 3 (instead of O(log log n)).
If the map corresponding to ending point of query interval contains matching value, compute s3[stop][value].prev - s3[start][value].next + 1. If it is greater than half of the query interval, return value. If cumulative frequencies are used instead of next/previous indexes, compute s3[stop+1][value].freq - s3[start][value].freq instead.
If nothing found on step 3, return "Nothing".
Main part of the algorithm is getting majority intervals from list of positions:
Assign weight to each position in the list: number_of_matching_values_to_the_left - number_of_nonmatching_values_to_the_left.
Filter only weights in strictly decreasing order (greedily) to the "prefix" array: for (auto x: positions) if (x < prefix.back()) prefix.push_back(x);.
Filter only weights in strictly increasing order (greedily, backwards) to the "suffix" array: reverse(positions); for (auto x: positions) if (x > suffix.back()) suffix.push_back(x);.
Scan "prefix" and "suffix" arrays together and find intervals from every "prefix" element to corresponding place in "suffix" array and from every "suffix" element to corresponding place in "prefix" array. (If all "suffix" elements' weights are less than given "prefix" element or their position is not to the right of it, no interval generated; if there is no "suffix" element with exactly the weight of given "prefix" element, get nearest "suffix" element with larger weight and extend interval with this weight difference to the right).
Merge overlapping intervals.
Properties 1 .. 3 for majority intervals are guaranteed by this algorithm. As for property #4, the only way I could imagine to cover some element with maximum number of majority intervals is like this: 11111111222233455666677777777. Here element 4 is covered by 2 * log n intervals, so this property seems to be satisfied. See more formal proof of this property at the end of this post.
Example:
For input array "0 1 2 0 0 1 1 0" the following lists of positions would be generated:
value positions
0 0 3 4 7
1 1 5 6
2 2
Positions for value 0 will get the following properties:
weights: 0:1 3:0 4:1 7:0
prefix: 0:1 3:0 (strictly decreasing)
suffix: 4:1 7:0 (strictly increasing when scanning backwards)
intervals: 0->4 3->7 4->0 7->3
merged intervals: 0-7
Positions for value 1 will get the following properties:
weights: 1:0 5:-2 6:-1
prefix: 1:0 5:-2
suffix: 1:0 6:-1
intervals: 1->none 5->6+1 6->5-1 1->none
merged intervals: 4-7
Query data structure:
positions value next prev
0 0 0 x
1..2 0 1 0
3 0 1 1
4 0 2 2
4 1 1 x
5 0 3 2
...
Query [0,4]:
prev[4][0]-next[0][0]+1=2-0+1=3
query size=5
3>2.5, returned result 0
Query [2,5]:
prev[5][0]-next[2][0]+1=2-1+1=2
query size=4
2=2, returned result "none"
Note that there is no attempt to inspect element "1" because its majority interval does not include either of these intervals.
Proof of property #4:
Majority intervals are constructed in such a way that strictly more than 1/3 of all their elements have corresponding value. This ratio is nearest to 1/3 for sub-arrays like any*(m-1) value*m any*m, for example, 01234444456789.
To make this proof more obvious, we could represent each interval as a point in 2D: every possible starting point represented by horizontal axis and every possible ending point represented by vertical axis (see diagram below).
All valid intervals are located on or above diagonal. White rectangle represents all intervals covering some array element (represented as unit-size interval on its lower right corner).
Let's cover this white rectangle with squares of size 1, 2, 4, 8, 16, ... sharing the same lower right corner. This divides white area into O(log n) areas similar to yellow one (and single square of size 1 containing single interval of size 1 which is ignored by this algorithm).
Let's count how many majority intervals may be placed into yellow area. One interval (located at the nearest to diagonal corner) occupies 1/4 of elements belonging to interval at the farthest from diagonal corner (and this largest interval contains all elements belonging to any interval in yellow area). This means that smallest interval contains strictly more than 1/12 values available for whole yellow area. So if we try to place 12 intervals to yellow area, we have not enough elements for different values. So yellow area cannot contain more than 11 majority intervals. And white rectangle cannot contain more than 11 * log n majority intervals. Proof completed.
11 * log n is overestimation. As I said earlier, it's hard to imagine more than 2 * log n majority intervals covering some element. And even this value is much greater than average number of covering majority intervals.
C++11 implementation. See it either at ideone or here:
#include <iostream>
#include <vector>
#include <map>
#include <algorithm>
#include <functional>
#include <random>
constexpr int SrcSize = 1000000;
constexpr int NQueries = 100000;
using src_vec_t = std::vector<int>;
using index_vec_t = std::vector<int>;
using weight_vec_t = std::vector<int>;
using pair_vec_t = std::vector<std::pair<int, int>>;
using index_map_t = std::map<int, index_vec_t>;
using interval_t = std::pair<int, int>;
using interval_vec_t = std::vector<interval_t>;
using small_map_t = std::vector<std::pair<int, int>>;
using query_vec_t = std::vector<small_map_t>;
constexpr int None = -1;
constexpr int Junk = -2;
src_vec_t generate_e()
{ // good query length = 3
src_vec_t src;
std::random_device rd;
std::default_random_engine eng{rd()};
auto exp = std::bind(std::exponential_distribution<>{0.4}, eng);
for (int i = 0; i < SrcSize; ++i)
{
int x = exp();
src.push_back(x);
//std::cout << x << ' ';
}
return src;
}
src_vec_t generate_ep()
{ // good query length = 500
src_vec_t src;
std::random_device rd;
std::default_random_engine eng{rd()};
auto exp = std::bind(std::exponential_distribution<>{0.4}, eng);
auto poisson = std::bind(std::poisson_distribution<int>{100}, eng);
while (int(src.size()) < SrcSize)
{
int x = exp();
int n = poisson();
for (int i = 0; i < n; ++i)
{
src.push_back(x);
//std::cout << x << ' ';
}
}
return src;
}
src_vec_t generate()
{
//return generate_e();
return generate_ep();
}
int trivial(const src_vec_t& src, interval_t qi)
{
int count = 0;
int majorityElement = 0; // will be assigned before use for valid args
for (int i = qi.first; i <= qi.second; ++i)
{
if (count == 0)
majorityElement = src[i];
if (src[i] == majorityElement)
++count;
else
--count;
}
count = 0;
for (int i = qi.first; i <= qi.second; ++i)
{
if (src[i] == majorityElement)
count++;
}
if (2 * count > qi.second + 1 - qi.first)
return majorityElement;
else
return None;
}
index_map_t sort_ind(const src_vec_t& src)
{
int ind = 0;
index_map_t im;
for (auto x: src)
im[x].push_back(ind++);
return im;
}
weight_vec_t get_weights(const index_vec_t& indexes)
{
weight_vec_t weights;
for (int i = 0; i != int(indexes.size()); ++i)
weights.push_back(2 * i - indexes[i]);
return weights;
}
pair_vec_t get_prefix(const index_vec_t& indexes, const weight_vec_t& weights)
{
pair_vec_t prefix;
for (int i = 0; i != int(indexes.size()); ++i)
if (prefix.empty() || weights[i] < prefix.back().second)
prefix.emplace_back(indexes[i], weights[i]);
return prefix;
}
pair_vec_t get_suffix(const index_vec_t& indexes, const weight_vec_t& weights)
{
pair_vec_t suffix;
for (int i = indexes.size() - 1; i >= 0; --i)
if (suffix.empty() || weights[i] > suffix.back().second)
suffix.emplace_back(indexes[i], weights[i]);
std::reverse(suffix.begin(), suffix.end());
return suffix;
}
interval_vec_t get_intervals(const pair_vec_t& prefix, const pair_vec_t& suffix)
{
interval_vec_t intervals;
int prev_suffix_index = 0; // will be assigned before use for correct args
int prev_suffix_weight = 0; // same assumptions
for (int ind_pref = 0, ind_suff = 0; ind_pref != int(prefix.size());)
{
auto i_pref = prefix[ind_pref].first;
auto w_pref = prefix[ind_pref].second;
if (ind_suff != int(suffix.size()))
{
auto i_suff = suffix[ind_suff].first;
auto w_suff = suffix[ind_suff].second;
if (w_pref <= w_suff)
{
auto beg = std::max(0, i_pref + w_pref - w_suff);
if (i_pref < i_suff)
intervals.emplace_back(beg, i_suff + 1);
if (w_pref == w_suff)
++ind_pref;
++ind_suff;
prev_suffix_index = i_suff;
prev_suffix_weight = w_suff;
continue;
}
}
// ind_suff out of bounds or w_pref > w_suff:
auto end = prev_suffix_index + prev_suffix_weight - w_pref + 1;
// end may be out-of-bounds; that's OK if overflow is not possible
intervals.emplace_back(i_pref, end);
++ind_pref;
}
return intervals;
}
interval_vec_t merge(const interval_vec_t& from)
{
using endpoints_t = std::vector<std::pair<int, bool>>;
endpoints_t ep(2 * from.size());
std::transform(from.begin(), from.end(), ep.begin(),
[](interval_t x){ return std::make_pair(x.first, true); });
std::transform(from.begin(), from.end(), ep.begin() + from.size(),
[](interval_t x){ return std::make_pair(x.second, false); });
std::sort(ep.begin(), ep.end());
interval_vec_t to;
int start; // will be assigned before use for correct args
int overlaps = 0;
for (auto& x: ep)
{
if (x.second) // begin
{
if (overlaps++ == 0)
start = x.first;
}
else // end
{
if (--overlaps == 0)
to.emplace_back(start, x.first);
}
}
return to;
}
interval_vec_t get_intervals(const index_vec_t& indexes)
{
auto weights = get_weights(indexes);
auto prefix = get_prefix(indexes, weights);
auto suffix = get_suffix(indexes, weights);
auto intervals = get_intervals(prefix, suffix);
return merge(intervals);
}
void update_qv(
query_vec_t& qv,
int value,
const interval_vec_t& intervals,
const index_vec_t& iv)
{
int iv_ind = 0;
int qv_ind = 0;
int accum = 0;
for (auto& interval: intervals)
{
int i_begin = interval.first;
int i_end = std::min<int>(interval.second, qv.size() - 1);
while (iv[iv_ind] < i_begin)
{
++accum;
++iv_ind;
}
qv_ind = std::max(qv_ind, i_begin);
while (qv_ind <= i_end)
{
qv[qv_ind].emplace_back(value, accum);
if (iv[iv_ind] == qv_ind)
{
++accum;
++iv_ind;
}
++qv_ind;
}
}
}
void print_preprocess_stat(const index_map_t& im, const query_vec_t& qv)
{
double sum_coverage = 0.;
int max_coverage = 0;
for (auto& x: qv)
{
sum_coverage += x.size();
max_coverage = std::max<int>(max_coverage, x.size());
}
std::cout << " size = " << qv.size() - 1 << '\n';
std::cout << " values = " << im.size() << '\n';
std::cout << " max coverage = " << max_coverage << '\n';
std::cout << " avg coverage = " << sum_coverage / qv.size() << '\n';
}
query_vec_t preprocess(const src_vec_t& src)
{
query_vec_t qv(src.size() + 1);
auto im = sort_ind(src);
for (auto& val: im)
{
auto intervals = get_intervals(val.second);
update_qv(qv, val.first, intervals, val.second);
}
print_preprocess_stat(im, qv);
return qv;
}
int do_query(const src_vec_t& src, const query_vec_t& qv, interval_t qi)
{
if (qi.first == qi.second)
return src[qi.first];
auto b = qv[qi.first].begin();
auto e = qv[qi.second + 1].begin();
while (b != qv[qi.first].end() && e != qv[qi.second + 1].end())
{
if (b->first < e->first)
{
++b;
}
else if (e->first < b->first)
{
++e;
}
else // if (e->first == b->first)
{
// hope this doesn't overflow
if (2 * (e->second - b->second) > qi.second + 1 - qi.first)
return b->first;
++b;
++e;
}
}
return None;
}
int main()
{
std::random_device rd;
std::default_random_engine eng{rd()};
auto poisson = std::bind(std::poisson_distribution<int>{500}, eng);
int majority = 0;
int nonzero = 0;
int failed = 0;
auto src = generate();
auto qv = preprocess(src);
for (int i = 0; i < NQueries; ++i)
{
int size = poisson();
auto ud = std::uniform_int_distribution<int>(0, src.size() - size - 1);
int start = ud(eng);
int stop = start + size;
auto res1 = do_query(src, qv, {start, stop});
auto res2 = trivial(src, {start, stop});
//std::cout << size << ": " << res1 << ' ' << res2 << '\n';
if (res2 != res1)
++failed;
if (res2 != None)
{
++majority;
if (res2 != 0)
++nonzero;
}
}
std::cout << "majority elements = " << 100. * majority / NQueries << "%\n";
std::cout << " nonzero elements = " << 100. * nonzero / NQueries << "%\n";
std::cout << " queries = " << NQueries << '\n';
std::cout << " failed = " << failed << '\n';
return 0;
}
Related work:
As pointed in other answer to this question, there is other work where this problem is already solved: "Range majority in constant time and linear space" by S. Durocher, M. He, I Munro, P.K. Nicholson, M. Skala.
Algorithm presented in this paper has better asymptotic complexities for query time: O(1) instead of O(log n) and for space: O(n) instead of O(n log n).
Better space complexity allows this algorithm to process larger data sets (comparing to the algorithm proposed in this answer). Less memory needed for preprocessed data and more regular data access pattern, most likely, allow this algorithm to preprocess data more quickly. But it is not so easy with query time...
Let's suppose we have input data most favorable to algorithm from the paper: n=1000000000 (it's hard to imagine a system with more than 10..30 gigabytes of memory, in year 2013).
Algorithm proposed in this answer needs to process up to 120 (or 2 query boundaries * 2 * log n) elements for each query. But it performs very simple operations, similar to linear search. And it sequentially accesses two contiguous memory areas, so it is cache-friendly.
Algorithm from the paper needs to perform up to 20 operations (or 2 query boundaries * 5 candidates * 2 wavelet tree levels) for each query. This is 6 times less. But each operation is more complex. Each query for succinct representation of bit counters itself contains a linear search (which means 20 linear searches instead of one). Worst of all, each such operation should access several independent memory areas (unless query size and therefore quadruple size is very small), so query is cache-unfriendly. Which means each query (while is a constant-time operation) is pretty slow, probably slower than in algorithm proposed here. If we decrease input array size, increased are the chances that proposed here algorithm is quicker.
Practical disadvantage of algorithm in the paper is wavelet tree and succinct bit counter implementation. Implementing them from scratch may be pretty time consuming. Using a pre-existing implementation is not always convenient.
the trick
When looking for a majority element, you may discard intervals that do not have a majority element. See Find the majority element in array. This allows you to solve this quite simply.
preparation
At preparation time, recursively keep dividing the array into two halves and store these array intervals in a binary tree. For each node, count the occurrence of each element in the array interval. You need a data structure that offers O(1) inserts and reads. I suggest using an unsorted_multiset, which on average behaves as needed (but worst case inserts are linear). Also check if the interval has a majority element and store it if it does.
runtime
At runtime, when asked to compute the majority element for a range, dive into the tree to compute the set of intervals that covers the given range exactly. Use the trick to combine these intervals.
If we have array interval 7 5 5 7 7 7, with majority element 7, we can split off and discard 5 5 7 7 since it has no majority element. Effectively the fives have gobbled up two of the sevens. What's left is an array 7 7, or 2x7. Call this number 2 the majority count of the majority element 7:
The majority count of a majority element of an array interval is the
occurrence count of the majority element minus the combined occurrence
of all other elements.
Use the following rules to combine intervals to find the potential majority element:
Discard the intervals that have no majority element
Combining two arrays with the same majority element is easy, just add up the element's majority counts. 2x7 and 3x7 become 5x7
When combining two arrays with different majority elements, the higher majority count wins. Subtract the lower majority count from the higher to find the resulting majority count. 3x7 and 2x3 become 1x7.
If their majority elements are different but have have equal majority counts, disregard both arrays. 3x7 and 3x5 cancel each other out.
When all intervals have been either discarded or combined, you are either left with nothing, in which case there is no majority element. Or you have one combined interval containing a potential majority element. Lookup and add this element's occurrence counts in all array intervals (also the previously discarded ones) to check if it really is the majority element.
example
For the array 1,1,1,2,2,3,3,2,2,2,3,2,2, you get the tree (majority count x majority element listed in brackets)
1,1,1,2,2,3,3,2,2,2,3,2,2
(1x2)
/ \
1,1,1,2,2,3,3 2,2,2,3,2,2
(4x2)
/ \ / \
1,1,1,2 2,3,3 2,2,2 3,2,2
(2x1) (1x3) (3x2) (1x2)
/ \ / \ / \ / \
1,1 1,2 2,3 3 2,2 2 3,2 2
(1x1) (1x3) (2x2) (1x2) (1x2)
/ \ / \ / \ / \ / \
1 1 1 2 2 3 2 2 3 2
(1x1) (1x1)(1x1)(1x2)(1x2)(1x3) (1x2)(1x2) (1x3) (1x2)
Range [5,10] (1-indexed) is covered by the set of intervals 2,3,3 (1x3), 2,2,2 (3x2). They have different majority elements. Subtract their majority counts, you're left with 2x2. So 2 is the potential majority element. Lookup and sum the actual occurrence counts of 2 in the arrays: 1+3 = 4 out of 6. 2 is the majority element.
Range [1,10] is covered by the set of intervals 1,1,1,2,2,3,3 (no majority element) and 2,2,2 (3x2). Disregard the first interval since it has no majority element, so 2 is the potential majority element. Sum the occurrence counts of 2 in all intervals: 2+3 = 5 out of 10. There is no majority element.
Actually, it can be done in constant time and linear space(!)
See https://cs.stackexchange.com/questions/16671/range-majority-queries-most-freqent-element-in-range and S. Durocher, M. He, I Munro, P.K. Nicholson, M. Skala, Range majority in constant time and linear space, Information and Computation 222 (2013) 169–179, Elsevier.
Their preparation time is O(n log n), the space needed is O(n) and queries are O(1). It is a theoretical paper and I don't claim to understand all of it but it seems far from impossible to implement. They're using wavelet trees.
For an implementation of wavelet trees, see https://github.com/fclaude/libcds
If you have unlimited memory you can and limited data range (like short int) do it even in O(N) time.
Go through array and count number of 1s, 2s, 3s, eta (number of entries for each value you have in array). You will need additional array X with sizeof(YouType) elements for this.
Go through array X and find maximum.
In total O(1) + O(N) operations.
Also you can limit yourself with O(N) memory, if you use map instead of array X.
But then you will need to find element on each iteration at stage 1. Therefore you will need O(N*log(N)) time in total.
You can use MAX Heap, with frequency of number as a deciding factor for Keeping Max Heap property,
I meant, e.g. for following input array
1 5 2 7 7 7 8 4 6 5
Heap would have all distinct elements with their frequency associated with them
Element = 1 Frequency = 1,
Element = 5 Frequency = 2,
Element = 2 Frequency = 1,
Element = 7 Frequency = 3,
Element = 8 Frequency = 1,
Element = 4 Frequency = 1,
Element = 6 Frequency = 1
As its MAX heap, Element 7 with frequency 3 would be at the root level,
Just check whether input range contains this element, if yes then this is the answer if no, then go to left subtree or right subtree as per input range and perform same checks.
O(N) would be required only once while creating a heap, but once its created, searching will be efficient.
Edit: Sorry, I was solving a different problem.
Sort the array and build an ordered list of pairs (value, number_of_occurrences) - it's O(N log N). Starting with
1 5 2 7 7 7 8 4 6
it will be
(1,1) (2,1) (4,1) (5,1) (6,1) (7,3) (8,1)
On top of this array, build a binary tree with pairs (best_value_or_none, max_occurrences). It will look like:
(1,1) (2,1) (4,1) (5,1) (6,1) (7,3) (8,1)
\ / \ / \ / |
(0,1) (0,1) (7,3) (8,1)
\ / \ /
(0,1) (7,3)
\ /
(7,3)
This structure definitely has a fancy name, but I don't remember it :)
From here, it's O(log N) to fetch the mode of any interval. Any interval can be split into O(log N) precomputed intervals; for example:
[4, 7] = [4, 5] + [6, 7]
f([4,5]) = (0,1)
f([6,7]) = (7,3)
and the result is (7,3).