looking for an efficient data structure to do a quick searches - c++

I have a list of elements around 1000. Each element (objects that i read from the file, hence i can arrange them efficiently at the beginning) containing contains 4 variables. So now I am doing the following, which is very inefficient at grand scheme of things:
void func(double value1, double value2, double value3)
{
fooArr[1000];
for(int i=0;i<1000; ++i)
{
//they are all numeric! ranges are < 1000
if(fooArr[i].a== value1
&& fooArr[i].b >= value2;
&& fooArr[i].c <= value2; //yes again value2
&& fooArr[i].d <= value3;
)
{
/* yay found now do something!*/
}
}
}
Space is not too important!
MODIFIED per REQUEST

If space isn't too important the easiest thing to do is to create a hash based on "a" Depending on how many conflicts you get on "a" it may make sense to make each node in the hash table point to a binary tree based off of "b" If b has a lot of conflicts, do the same for c.
That first index into the hash, depending on how many conflicts, will save you a lot of time for very little coding or data structures work.

First, sort the list on increasing a and decreasing b. Then build an index on a (values are integers from 0 to 999. So, we've got
int a_index[1001]; // contains starting subscript for each value
a_index[1000] = 1000;
for (i = a_index[value1]; i < a_index[value1 + 1] && fooArr[i].b >= value2; ++i)
{
if (fooArr[i].c <= value2 && fooArr[i].d <= value3) /* do stuff */
}
Assuming I haven't made a mistake here, this limits the search to the subscripts where a and b are valid, which is likely to cut your search times drastically.

Since you are have only three properties to match you could use a hash table. When performing a search, you use the hash table (which indexes the a-property) to find all entries where a matches SomeConstant. After that you check if b and c also match your constants. This way you can reduce the number of comparisons. I think this would speed the search up quite a bit.
Other than that you could build three binary search trees. One sorted by each property. After searching all three of them you perform your action for those which match your values in each tree.

Based on what you've said (in both the question and the comments) there are only a very few values for a (something like 10).
That being the case, I'd build an index on the values of a where each one points directly to all the elements in the fooArr with that value of a:
std::vector<std::vector<foo *> > index(num_a_values);
for (int i=0; i<1000; i++)
index[fooArr[i].a].push_back(&fooArr[i]);
Then when you get a value to look up an item, you go directly to those for which fooArr[i].a==value1:
std::vector<foo *> const &values = index[value1];
for (int i=0; i<values.size(); i++) {
if (value2 <= values[i]->b
&& value2 >= values[i]->c
&& value3 >= values[i]->d) {
// yay, found something
}
}
This way, instead of looking at 1000 items in fooArray each time, you look at an average of 100 each time. If you want still more speed, the next step would be to sort the items in each vector in the index based on the value of b. This will let you find the lower bound for value2 using a binary search instead of a linear search, reducing ~50 comparisons to ~10. Since you've sorted it by b, from that point onward you don't have to compare value2 to b -- you know exactly where the rest of the numbers that satisfy the inequality are, so you only have to compare to c and d.
You might also consider another approach based on the limited range of the numbers: 0 to 1000 can be represented in 10 bits. Using some bit-twiddling, you could combine three fields into a single 32-bit number, which would let the compiler compare all three at once, instead of in three separate operations. Getting this right is a little tricky, but once you to, it could roughly triple the speed again.

I think using kd-tree would be appropriate.
If there aren't many conflicts with a then hashing/indexing a might resolve your problem.
Anyway if that doesn't work I suggest using kd-tree.
First do a table of multiple kd-trees. Index them with a.
Then implement a kd-tree for each a value with 3-dimensions in directions b, c, d.
Then when searching - first index to appropriate kd-tree with a, and then search from kd-tree with your limits. Basically you'll do a range search.
Kd-tree
You'll get your answer in O(L^(2/3)+m), where L is the number of elements in appropriate kd-tree and m is the number of matching points.
Something better that I found is Range Tree. This might be what you are looking for.
It's fast. It'll answer your query in O(log^3(L)+m). (Unfortunately don't know about Range Tree much.)

Well, let's have a go.
First of all, the == operator calls for a pigeon-hole approach. Since we are talking about int values in the [0,1000] range, a simple table is good.
std::vector<Bucket1> myTable(1001, /*MAGIC_1*/); // suspense
The idea of course is that you will find YourObject instance in the bucket defined for its a attribute value... nothing magic so far.
Now on the new stuff.
&& fooArr[i].b >= value2
&& fooArr[i].c <= value2 //yes again value2
&& fooArr[i].d <= value3
The use of value2 is tricky, but you said you did not care for space right ;) ?
typedef std::vector<Bucket2> Bucket1;
/*MAGIC_1*/ <-- Bucket1(1001, /*MAGIC_2*/) // suspense ?
A BucketA instance will have in its ith position all instances of YourObject for which yourObject.c <= i <= yourObject.b
And now, same approach with the d.
typedef std::vector< std::vector<YourObject*> > Bucket2;
/*MAGIC_2*/ <-- Bucket2(1001)
The idea is that the std::vector<YourObject*> at index ith contains a pointer to all instances of YourObject for which yourObject.d <= i
Putting it altogether!
class Collection:
{
public:
Collection(size_t aMaxValue, size_t bMaxValue, size_t dMaxValue);
// prefer to use unsigned type for unsigned values
void Add(const YourObject& i);
// Pred is a unary operator taking a YourObject& and returning void
template <class Pred>
void Apply(int value1, int value2, int value3, Pred pred);
// Pred is a unary operator taking a const YourObject& and returning void
template <class Pred>
void Apply(int value1, int value2, int value3, Pred pred) const;
private:
// List behaves nicely with removal,
// if you don't plan to remove, use a vector
// and store the position within the vector
// (NOT an iterator because of reallocations)
typedef std::list<YourObject> value_list;
typedef std::vector<value_list::iterator> iterator_vector;
typedef std::vector<iterator_vector> bc_buckets;
typedef std::vector<bc_buckets> a_buckets;
typedef std::vector<a_buckets> buckets_t;
value_list m_values;
buckets_t m_buckets;
}; // class Collection
Collection::Collection(size_t aMaxValue, size_t bMaxValue, size_t dMaxValue) :
m_values(),
m_buckets(aMaxValue+1,
a_buckets(bMaxValue+1, bc_buckets(dMaxValue+1))
)
)
{
}
void Collection::Add(const YourObject& object)
{
value_list::iterator iter = m_values.insert(m_values.end(), object);
a_buckets& a_bucket = m_buckets[object.a];
for (int i = object.c; i <= object.b; ++i)
{
bc_buckets& bc_bucket = a_bucket[i];
for (int j = 0; j <= object.d; ++j)
{
bc_bucket[j].push_back(index);
}
}
} // Collection::Add
template <class Pred>
void Collection::Apply(int value1, int value2, int value3, Pred pred)
{
index_vector const& indexes = m_buckets[value1][value2][value3];
BOOST_FOREACH(value_list::iterator it, indexes)
{
pred(*it);
}
} // Collection::Apply<Pred>
template <class Pred>
void Collection::Apply(int value1, int value2, int value3, Pred pred) const
{
index_vector const& indexes = m_buckets[value1][value2][value3];
// Promotion from value_list::iterator to value_list::const_iterator is ok
// The reverse is not, which is why we keep iterators
BOOST_FOREACH(value_list::const_iterator it, indexes)
{
pred(*it);
}
} // Collection::Apply<Pred>
So, admitedly adding and removing items to that collections will cost.
Furthermore, you have (aMaxValue + 1) * (bMaxValue + 1) * (dMaxValue + 1) std::vector<value_list::iterator> stored, which is a lot.
However, Collection::Apply complexity is roughly k applications of Pred where k is the number of items which match the parameters.
I am looking for a review there, not sure I got all the indexes right oO

If your app is already using a database then just put them in a table and use a query to find it. I use mysql in a few of my apps and would recommend it.

First for each a do different table...
do a tabel num for numbers that have the same a.
do 2 index tabels each with 1000 rows.
index table contains integer representation of a split which numbers
will be involved.
For example let's say you have values in the array
(ignoring a because we have a table for each a value)
b = 96 46 47 27 40 82 9 67 1 15
c = 76 23 91 18 24 20 15 43 17 10
d = 44 30 61 33 21 52 36 70 98 16
then the index table values for the row 50, 20 are:
idx[a].bc[50] = 0000010100
idx[a].d[50] = 1101101001
idx[a].bc[20] = 0001010000
idx[a].d[20] = 0000000001
so let's say you do func(a, 20, 50).
Then to get which numbers are involved you do:
g = idx[a].bc[20] & idx[a].d[50];
Then g has 1-s for each number you have to deal with. If you don't
need the array values then you can just do a populationCount on g. And
do the inner thing popCount(g) times.
You can do
tg = g
n = 0
while (tg > 0){
if(tg & 1){
// do your stuff
}
tg = tg >>> 1;
n++;
}
maybe it can be improved in tg = tg >>> 1; n++; part by skipping over many zeros, but I have no idea if that's possible. It should considerably faster than your current approach because all variables for the loop are in registers.

As pmg said, the idea is to eliminate as many comparisons as possible. Obviously you won't have 4000 comparisons. That would require that all 1000 elements pass the first test, which would then be redundant. Apparently there are only 10 values of a, hence 10% passes that check. So, you'd do 1000 + 100 + ? + ? checks. Let's assume +50+25, for a total of 1175.
You'd need to know how a,b,c,d and value1, 2 and 3 are distributed to decide exactly what's fastest. We only know that a can have 10 values, and we presume that value1 has the same domain. In that case, binning by a can reduce it to an O(1) operation to get the right bin, plus the same 175 checks further on. But if b,c and value2 effectively form 50 buckets, you could find the right bucket again in O(1). Yet each bucket would now have an average of 20 elements, so you'd only need 35 tests (80% reduction). So, data distribution matters here. Once you understand your data, the algorithm will be clear.

Look, this is just a linear search. It would be nice if you could do a search that scales up better, but your complex matching requirements make it unclear to me whether it's even possible to, say, keep it sorted and use a binary search.
Having said this, perhaps one possibility is to generate some indexes. The main index might be a dictionary keyed on the a property, associating it with a list of elements with the same value for that property. Assuming the values for this property are well-distributed, it would immediately eliminate the overwhelming majority of comparisons.
If the property has a limited number of values, then you could consider adding an additional index which sorts items by b and maybe even another that sorts by c (but in the opposite order).

You can use hash_set from Standard Template Library(STL), this will give you very efficient implementation. complexity of your search would be O(1)
here is link: http://www.sgi.com/tech/stl/hash_set.html
--EDIT--
declare new Struct which will hold your variables, overload comparison operators and make the hash_set of this new struct. every time you want to search, create new object with your variables and pass it to hash_set method "find".
It seems that hash_set is not mandatory for STL, therefore you can use set and it will give you O(LogN) complexity for searching.
here is example:
#include <cstdlib>
#include <iostream>
#include <set>
using namespace std;
struct Obj{
public:
Obj(double a, double b, double c, double d){
this->a = a;
this->b = b;
this->c = c;
this->d = d;
}
double a;
double b;
double c;
double d;
friend bool operator < ( const Obj &l, const Obj &r ) {
if(l.a != r.a) return l.a < r.a;
if(l.b != r.b) return l.a < r.b;
if(l.c != r.c) return l.c < r.c;
if(l.d != r.d) return l.d < r.d;
return false;
}
};
int main(int argc, char *argv[])
{
set<Obj> A;
A.insert( Obj(1,2,3,4));
A.insert( Obj(16,23,36,47));
A.insert(Obj(15,25,35,43));
Obj c(1,2,3,4);
A.find(c);
cout << A.count(c);
system("PAUSE");
return EXIT_SUCCESS;
}

Related

Using sort function to sort vector of tuples in a chained manner

So I tried sorting my list of tuples in a manner that next value's first element equals the second element of the present tuple.(first tuple being the one with smallest first element)
(x can be anything)
unsorted
3 5 x
4 6 x
1 3 x
2 4 x
5 2 x
sorted
1 3 x
3 5 x
5 2 x
2 4 x
4 6 x
I used the following function as my third argument in the custom sort function
bool myCompare(tuple<int,int,int>a,tuple<int,int,int>b){
if(get<1>(a) == get<2>(b)){
return true;
}
return false;
}
But my output was unchanged. Please help me fix the function or suggest me another way.
this can't be achieved by using std::sort with a custom comparison function. Your comparison function doesn't establish a strict weak order onto your elements.
The std::sort documentation states that the comparison function has to fulfill the Compare requirements. The Comparison requirements say the function has to introduce a strict weak ordering.
See https://en.wikipedia.org/wiki/Weak_ordering for the properties of a strict weak order
Compare requirements: https://en.cppreference.com/w/cpp/named_req/Compare
The comparison function has to return true if the first argument is before the second argument with respect to the strict weak order.
For example the tuple a=(4, 4, x) violates the irreflexivity property comp(a, a) == false
Or a=(4, 6, x) and b=(6, 4, y) violate the asymmetry property that if comp(a, b) == true it is not the case that comp(b, a) == true
I am not sure, where the real problem is coming from.
But the background is the Cyclic Permutation Problem.
In your special case you are looking for a k-cycle where k is equal to the count of tuples. I drafted a solution for you that will show all cycles (not only the desired k-cycle).
And I use the notation described int the provided link. The other values of the tuple are irrelevant for the problem.
But how to implement?
The secret is to select the correct container types. I use 2. For a cyle, I use a std::unordered_set. This can contain only unique elements. With that, an infinite cycle will be prevented. For example: 0,1,3,0,1,3,0,1,3 . . . is not possible, because each digit can only be once in the container. That will stop our way through the permutations. As soon as we see a number that is already in a cycle, we stop.
All found cycles will be stored in the second container type: A std::set. The std::set can also contain only unique values and, the values are ordered. Because we store complex data in the std::set, we create a custom comparator for it. We need to take care that the std::set will not contain 2 double entries. And double would be in our case also 0,1,3 and 1,3,0. In our custom comparator, we will first copy the 2 sets into a std::vector and sort the std::vectors. This will make 1,3,0 to 0,1,3. Then we can easily detect doubles.
Please note:
I do always only store a value from the first permutation in the cycle. The 2nd is used as helper, to find the index of the next value to evaluate.
Please see the below code. I will produces 4 non trivial cycles- And one has the number of elements as expected: 1,3,5,2,4.
Porgram output:
Found Cycles:
(1,3,5,2,4)(3,5,2,4)(2,4)(5,2,4)
Please digest.
#include <iostream>
#include <vector>
#include <algorithm>
#include <unordered_set>
#include <iterator>
#include <set>
// Make reading easier and define some alies names
using MyType = int;
using Cycle = std::unordered_set<MyType>;
using Permutation = std::vector<MyType>;
using Permutations = std::vector<Permutation>;
// We do not want to have double results.
// A double cyle is also a Cycle with elements in different order
// So define custom comparator functor for our resulting set
struct Comparator {
bool operator () (const Cycle& lhs, const Cycle& rhs) const {
// Convert the unordered_sets to vectors
std::vector<MyType> v1(lhs.begin(), lhs.end());
std::vector<MyType> v2(rhs.begin(), rhs.end());
// Sort them
std::sort(v1.begin(), v1.end());
std::sort(v2.begin(), v2.end());
// Compare them
return v1 < v2;
}
};
// Resulting cycles
using Cycles = std::set<Cycle, Comparator>;
int main() {
// The source data
Permutations perms2 = {
{3,4,1,2,5},
{5,6,3,4,2} };
// Lamda to find the index of a given number in the first permutation
auto findPos = [&perms2](const MyType& m) {return std::distance(perms2[0].begin(), std::find(perms2[0].begin(), perms2[0].end(), m)); };
// Here we will store our resulting set of cycles
Cycles resultingCycles{};
// Go through all single elements of the first permutation
for (size_t currentColumn = 0U; currentColumn < perms2[0].size(); ++currentColumn) {
// This is a temporary for a cycle that we found in this loop
Cycle trialCycle{};
// First value to start with
size_t startColumn = currentColumn;
// Follow the complete path through the 2 permutations
for (bool insertResult{ true }; insertResult; ) {
// Insert found element from the first permutation in the current cycle
const auto& [newElement, insertOk] = trialCycle.insert(perms2[0][startColumn]);
// Find the index of the element under the first value (from the 2nd permutation)
startColumn = findPos(perms2[1][startColumn]);
// Check if we should continue (Could we inster a further element in our current cycle)
insertResult = insertOk && startColumn < perms2[0].size();
}
// We will only consider cycles with a length > 1
if (trialCycle.size() > 1) {
// Store the current temporary cycle as an additional result.
resultingCycles.insert(trialCycle);
}
}
// Simple output
std::cout << "\n\nFound Cycles:\n\n";
// Go through all found cycles
for (const Cycle& c : resultingCycles) {
// Print an opening brace
std::cout << "(";
// Handle the comma delimiter
std::string delimiter{};
// Print all integer values of the cycle
for (const MyType& m : c) {
std::cout << delimiter << m;
delimiter = ",";
}
std::cout << ")";
}
std::cout << "\n\n";
return 0;
}

Sort std::vector<int> but ignore a certain number

I have an std::vector<int> of the size 10 and each entry is initially -1. This vector represents a leaderboard for my game (high scores), and -1 just means there is no score for that entry.
std::vector<int> myVector;
myVector.resize(10, -1);
When the game is started, I want to load the high score from a file. I load each line (up to 10 lines), convert the value that is found to an int with std::stoi, and if the number is >0 I replace it with the -1 currently in the vector at the current position.
All this works. Now to the problem:
Since the values in the file aren't necessarily sorted, I want to sort myVector after I load all entries. I do this with
std::sort(myVector.begin(), myVector.end());
This sorts it in ascending order (lower score is better in my game).
The problem is that, since the vector is initially filled with -1 and there aren't necessarily 10 entries saved in the high scores file, the vector might contain a few -1 in addition to the player's scores.
That means when sorting the vector with the above code, all the -1 will appear before the player's scores.
My question is: How do I sort the vector (in ascending order), but all entries with -1 will be put at the end (since they don't represent a real score)?
Combine partitioning and sorting:
std::sort(v.begin(),
std::partition(v.begin(), v.end(), [](int n){ return n != -1; }));
If you store the iterator returned from partition, you already have a complete description of the range of non-trivial values, so you don't need to look for −1s later.
You can provide lambda as parameter for sort:
std::sort(myVector.begin(), myVector.end(),[]( int i1, int i2 ) {
if( i1 == -1 ) return false;
if( i2 == -1 ) return true;
return i1 < i2; }
);
here is the demo (copied from Kerrek)
but it is not clear how you realize where is which score after sort.
From your description, it appears that the score can be never negative. In that case, I'd recommend the scores to be a vector of unsigned int. You can define a constant
const unsigned int INFINITY = -1;
and load your vector with INFINITY initially. INFINITY is the maximum positive integer that can be stored in a 32 bit unsigned integer (which also corresponds to -1 in 2's complement)
Then you could simply sort using
sort(v.begin(),v.end());
All INFINITY will be at the end after the sort.
std::sort supports using your own comparison function with the signature bool cmp(const T& a, const T& b);. So write your own function similar to this:
bool sort_negatives(const int& a, const int& b)
{
if (a == -1) {
return false;
}
if (b == -1) {
return true;
}
return a < b;
}
And then call sort like std::sort(myVector.begin(), myVector.end(), sort_negatives);.
EDIT: Fixed the logic courtesy of Slava. If you are using a compiler with C++11 support, use the lambda or partition answers, but this should work on compilers pre C++11.
For the following, I assume that the -1 values are all placed at the end of the vector. If they are not, use KerrekSB's method, or make sure that you do not skip the indices in the vector for which no valid score is in the file (by using an extra index / iterator for writing to the vector).
std::sort uses a pair of iterators. Simply provide the sub-range which contains non--1 values. You already know the end of this range after reading from a file. If you already use iterators to fill the vector, like in
auto it = myVector.begin();
while (...) {
*it = stoi(...);
++it;
}
then simply use it instead of myVector.end():
std::sort(myVector.begin(), it);
Otherwise (i.e., when using indices to fill up the vector, let's say i is the number of values), use
std::sort(myVector.begin(), myVector.begin() + i);
An alternative approach is to use reserve() instead of resize().
std::vector<int> myVector;
myVector.reserve(10);
for each line in file:
int number_in_line = ...;
myVector.push_back(number_in_line);
std::sort(myVector.begin(), myVector.end());
This way, the vector would have only the numbers that are actually in file, no extra (spurious) values (e.g. -1). If the vector need to be later passed to other module or function for further processing, they do not need to know about the special nature of '-1' values.

Simulate random iteration of array

I have an array of given size. I want to traverse it in pseudorandom order, keeping array intact and visiting each element once. It will be best if current state can be stored in a few integers.
I know you can't have full randomness without storing full array, but I don't need the order to be really random. I need it to be perceived as random by user. The solution should use sub-linear space.
One possible suggestion - using large prime number - is given here. The problem with this solution is that there is an obvious fixed step (taken module array size). I would prefer a solution which is not so obviously non-random. Is there a better solution?
How about this algorithm?
To pseudo-pseudo randomly traverse an array of size n.
Create a small array of size k
Use the large prime number method to fill the small array, i = 0
Randomly remove a position using a RNG from the small array, i += 1
if i < n - k then add a new position using the large prime number method
if i < n goto 3.
the higher k is the more randomness you get. This approach will allow you to delay generating numbers from the prime number method.
A similar approach can be done to generate a number earlier than expected in the sequence by creating another array, "skip-list". Randomly pick items later in the sequence, use them to traverse the next position, and then add them to the skip-list. When they naturally arrive they are searched for in the skip-list and suppressed and then removed from the skip-list at which point you can randomly add another item to the skip-list.
The idea of a random generator that simulates a shuffle is good if you can get one whose maximum period you can control.
A Linear Congruential Generator calculates a random number with the formula:
x[i + 1] = (a * x[i] + c) % m;
The maximum period is m and it is achieved when the following properties hold:
The parameters c and m are relatively prime.
For every prime number r dividing m, a - 1 is a multiple of r.
If m is a multiple of 4 then also a - 1 is multiple of 4.
My first darft involved making m the next multiple of 4 after the array length and then finding suitable a and c values. This was (a) a lot of work and (b) yielded very obvious results sometimes.
I've rethought this approach. We can make m the smallest power of two that the array length will fit in. The only prime factor of m is then 2, which will make every odd number relatively prime to it. With the exception of 1 and 2, m will be divisible by 4, which means that we must make a - 1 a multiple of 4.
Having a greater m than the array length means that we must discard all values that are illegal array indices. This will happen at most every other turn and should be negligible.
The following code yields pseudo random numbers with a period of exaclty m. I've avoided trivial values for a and c and on my (not too numerous) spot cheks, the results looked okay. At least there was no obvious cycling pattern.
So:
class RandomIndexer
{
public:
RandomIndexer(size_t length) : len(length)
{
m = 8;
while (m < length) m <<= 1;
c = m / 6 + uniform(5 * m / 6);
c |= 1;
a = m / 12 * uniform(m / 6);
a = 4*a + 1;
x = uniform(m);
}
size_t next()
{
do { x = (a*x + c) % m; } while (x >= len);
return x;
}
private:
static size_t uniform(size_t m)
{
double p = std::rand() / (1.0 + RAND_MAX);
return static_cast<int>(m * p);
}
size_t len;
size_t x;
size_t a;
size_t c;
size_t m;
};
You can then use the generator like this:
std::vector<int> list;
for (size_t i = 0; i < 3; i++) list.push_back(i);
RandomIndexer ix(list.size());
for (size_t i = 0; i < list.size(); i++) {
std::cout << list[ix.next()]<< std::endl;
}
I am aware that this still isn't a great random number generator, but it is reasonably fast, doesn't require a copy of the array and seems to work okay.
If the approach of picking a and c randomly yields bad results, it might be a good idea to restrict the generator to some powers of two and to hard-code literature values that have proven to be good.
As pointed out by others, you can create a sort of "flight plan" upfront by shuffling an array of array indices and then follow it. This violates the "it will be best if current state can be stored in a few integers" constraint but does it really matter? Are there tight performance constraints? After all, I believe that if you don't accept repetitions, than you need to store the items you already visited somewhere or somehow.
Alternatively, you can opt for an intrusive solution and store a bool inside each element of the array, telling you whether the element was already selected or not. This can be done in an almost clean way by employing inheritance (multiple as needed).
Many problems come with this solution, e.g. thread safety, and of course it violates the "keep the array intact" constraint.
Quadratic residues which you have mentioned ("using a large prime") are well-known, will work, and guarantee iterating each and every element exactly once (if that is required, but it seems that's not strictly the case?). Unluckily they are not "very random looking", and there are a few other requirements to the modulo in addition to being prime for it to work.
There is a page on Jeff Preshing's site which describes the technique in detail and suggests to feed the output of the residue generator into the generator again with a fixed offset.
However, since you said that you merely need "perceived as random by user", it seems that you might be able to do with feeding a hash function (say, cityhash or siphash) with consecutive integers. The output will be a "random" integer, and at least so far there will be a strict 1:1 mapping (since there are a lot more possible hash values than there are inputs).
Now the problem is that your array is most likely not that large, so you need to somehow reduce the range of these generated indices without generating duplicates (which is tough).
The obvious solution (taking the modulo) will not work, as it pretty much guarantees that you get a lot of duplicates.
Using a bitmask to limit the range to the next greater power of two should work without introducing bias, and discarding indices that are out of bounds (generating a new index) should work as well. Note that this needs non-deterministic time -- but the combination of these two should work reasonably well (a couple of tries at most) on the average.
Otherwise, the only solution that "really works" is shuffling an array of indices as pointed out by Kamil Kilolajczyk (though you don't want that).
Here is a java solution, which can be easily converted to C++ and similar to M Oehm's solution above, albeit with a different way of choosing LCG parameters.
import java.util.Enumeration;
import java.util.Random;
public class RandomPermuteIterator implements Enumeration<Long> {
int c = 1013904223, a = 1664525;
long seed, N, m, next;
boolean hasNext = true;
public RandomPermuteIterator(long N) throws Exception {
if (N <= 0 || N > Math.pow(2, 62)) throw new Exception("Unsupported size: " + N);
this.N = N;
m = (long) Math.pow(2, Math.ceil(Math.log(N) / Math.log(2)));
next = seed = new Random().nextInt((int) Math.min(N, Integer.MAX_VALUE));
}
public static void main(String[] args) throws Exception {
RandomPermuteIterator r = new RandomPermuteIterator(100);
while (r.hasMoreElements()) System.out.print(r.nextElement() + " ");
//output:50 52 3 6 45 40 26 49 92 11 80 2 4 19 86 61 65 44 27 62 5 32 82 9 84 35 38 77 72 7 ...
}
#Override
public boolean hasMoreElements() {
return hasNext;
}
#Override
public Long nextElement() {
next = (a * next + c) % m;
while (next >= N) next = (a * next + c) % m;
if (next == seed) hasNext = false;
return next;
}
}
maybe you could use this one: http://www.cplusplus.com/reference/algorithm/random_shuffle/ ?

Most frequent character in range

I have a string s of length n. What is the most efficient data structure / algorithm to use for finding the most frequent character in range i..j?
The string doesn't change over time, I just need to repeat queries that ask for the most frequent char among s[i], s[i + 1], ... , s[j].
An array in which you hold the number of occurences of each character. You increase the respective value while iterating throught the string once. While doing this, you can remember the current max in the array; alternitively, look for the highest value in the array at the end.
Pseudocode
arr = [0]
for ( char in string )
arr[char]++
mostFrequent = highest(arr)
Do a single iteration over the array and for each position remember how many occurances of each character are there up to that position. So something like this:
"abcdabc"
for index 0:
count['a'] = 1
count['b'] = 0
etc...
for index 1:
....
count['a'] = 1
count['b'] = 1
count['c'] = 0
etc...
for index 2:
....
count['a'] = 1
count['b'] = 1
count['c'] = 1
....
And so on. For index 6:
....
count['a'] = 2
count['b'] = 2
count['c'] = 2
count['d'] = 1
... all others are 0
After you compute this array you can get the number of occurances of a given letter in an interval (i, j) in constant time - simply compute count[j] - count[i-1] (careful here for i = 0!).
So for each query you will have to iterate over all letters not over all characters in the interval and thus instead of iterating over 10^6 characters you will only pass over at most 128(assuming you only have ASCII symbols).
A drawback - you need more memory, depending on the size of the alphabet you are using.
If you wish to get efficient results on intervals, you can build an integral distribution vector at each index of your sequence. Then by subtracting integral distributions at j+1 and i you can obtain the distribution at the interval from s[i],s[i+1],...,s[j].
Some pseudocode in Python follows. I assume your characters are chars, hence 256 distribution entries.
def buildIntegralDistributions(s):
IDs=[] # integral distribution
D=[0]*256
IDs.append(D[:])
for x in s:
D[ord(x)]+=1
IDs.append(D[:])
return IDs
def getIntervalDistribution(IDs, i,j):
D=[0]*256
for k in range(256):
D[k]=IDs[j][k]-IDs[i][k]
return D
s='abababbbb'
IDs=buildIntegralDistributions(s)
Dij=getIntervalDistribution(IDs, 2,4)
>>> s[2:4]
'ab'
>>> Dij[ord('a')] # how many 'a'-s in s[2:4]?
1
>>> Dij[ord('b')] # how many 'b'-s in s[2:4]?
1
You need to specify your algorithmic requirements in terms of space and time complexity.
If you insist on O(1) space complexity, just sorting (e.g. using lexicographic ordering of bits if there is no natural comparison operator available) and counting the number of occurances of the highest element will give you O(N log N) time complexity.
If you insist on O(N) time complexity, use #Luchian Grigore 's solution which also takes O(N) space complexity (well, O(K) for K-letter alphabet).
string="something"
arrCount[string.length()];
after each access of string call freq()
freq(char accessedChar){
arrCount[string.indexOf(x)]+=1
}
to get the most frequent char call the string.charAt(arrCount.max())
assuming the string is constant, and different i and j will be passed to query occurences.
If you want to minimize processing time you can make a
struct occurences{
char c;
std::list<int> positions;
};
and keep a std::list<occurences> for each character. for fast searching you can keep positions ordered.
and if you want to minimize memory you can just keep an incrementing integer and loop through i .. j
The most time-efficient algorithm, as has been suggested, is to store the frequencies of each character in an array. Note, however, that if you simply index the array with the characters, you may invoke undefined behaviour. Namely, if you are processing text that contains code points outside of the range 0x00-0x7F, such as text encoded with UTF-8, you may end up with a segmentation violation at best, and stack data corruption at worst:
char frequncies [256] = {};
frequencies ['á'] = 9; // Oops. If our implementation represents char using a
// signed eight-bit integer, we just referenced memory
// outside of our array bounds!
A solution that properly accounts for this would look something like the following:
template <typename charT>
charT most_frequent (const basic_string <charT>& str)
{
constexpr auto charT_max = numeric_limits <charT>::max ();
constexpr auto charT_min = numeric_limits <charT>::lowest ();
size_t frequencies [charT_max - charT_min + 1] = {};
for (auto c : str)
++frequencies [c - charT_min];
charT most_frequent;
size_t count = 0;
for (charT c = charT_min; c < charT_max; ++c)
if (frequencies [c - charT_min] > count)
{
most_frequent = c;
count = frequencies [c - charT_min];
}
// We have to check charT_max outside of the loop,
// as otherwise it will probably never terminate
if (frequencies [charT_max - charT_min] > count)
return charT_max;
return most_frequent;
}
If you want to iterate over the same string multiple times, modify the above algorithm (as construct_array) to use a std::array <size_t, numeric_limits <charT>::max () - numeric_limits <charT>::lowest () + 1>. Then return that array instead of the max character after the first for loop and omit the part of the algorithm that finds the most frequent character. Construct a std::map <std::string, std::array <...>> in your top-level code and store the returned array in that. Then move the code for finding the most frequent character into that top-level code and use the cached count array:
char most_frequent (string s)
{
static map <string, array <...>> cache;
if (cache.count (s) == 0)
map [s] = construct_array (s);
// find the most frequent character, as above, replacing `frequencies`
// with map [s], then return it
}
Now, this only works for whole strings. If you want to process relatively small substrings repeatedly, you should use the first version instead. Otherwise, I'd say that your best bet is probably to do something like the second solution, but partitioning the string into manageable chunks; that way, you can fetch most of the information from your cache, only having to recalculate the frequencies in the chunks in which your iterators reside.
The fastest would be to use an unordered_map or similar:
pair<char, int> fast(const string& s) {
unordered_map<char, int> result;
for(const auto i : s) ++result[i];
return *max_element(cbegin(result), cend(result), [](const auto& lhs, const auto& rhs) { return lhs.second < rhs.second; });
}
The lightest, memory-wise, would require a non-constant input which could be sorted, such that find_first_not_of or similar could be used:
pair<char, int> light(string& s) {
pair<char, int> result;
int start = 0;
sort(begin(s), end(s));
for(auto finish = s.find_first_not_of(s.front()); finish != string::npos; start = finish, finish = s.find_first_not_of(s[start], start)) if(const int second = finish - start; second > result.second) result = make_pair(s[start], second);
if(const int second = size(s) - start; second > result.second) result = make_pair(s[start], second);
return result;
}
It should be noted that both of these functions have the precondition of a non-empty string. Also if there is a tie for the most characters in the string both functions will return the character that is lexographically first as having the most.
Live Example

Finding which bin a values fall into

I'm trying to find which category C a double x belongs to.
My categories are defined as strings names and doubles values in a file like this
A 1.0
B 2.5
C 7.0
which should be interpreted like this
"A": 0 < x <= 1.0
"B": a < x <= 2.5
"C": b < x <= 7.0
(the input can have arbitrary length and may have to be sorted by their values). I simply need a function like this
std::string findCategory(categories_t categories, double x) {
...insert magic here
}
so for this example I'd expect
findCategory(categories, 0.5) == "A"
findCategory(categories, 1.9) == "B"
findCategory(categories, 6.0) == "C"
So my question is a) how to write the function and b) what the best choice of category_t may be (using stl in pre 11 C++). I made several attempts, all of which were... less than successful.
One option would be to use the std::map container with doubles as keys and values corresponding to what value is assigned to the range whose upper endpoint is the given value. For example, given your file, you would have a map like this:
std::map<double, std::string> lookup;
lookup[1.0] = "A";
lookup[2.5] = "B";
lookup[7.0] = "C";
Then, you could use the std::map::lower_bound function, given some point, to get back the key/value pair whose key (upper endpoint) is the first key in the map that is at least as large as the point in question. For example, with the above map, lookup.lower_bound(1.37) would return an iterator whose value is "B." lookup.lower_bound(2.56) would return an iterator whose value is "C." These lookups are fast; they take O(log n) time for a map with n elements.
In the above, I'm assuming that the values you're looking up are all nonnegative. If negative values are allowed, you can add a quick test in to check whether the value is negative before you do any lookups. That way, you can eliminate spurious results.
For what it's worth, if you happen to know something about the distribution of your lookups (say, they're uniformly distributed) it's possible to build a special data structure called an optimal binary search tree that will give better access times than the std::map. Also, depending on your application, there may be even faster options available. For example, if you're doing this because you want to randomly choose one of the outcomes with differing probabilities, then I would suggest looking into this article on the alias method, which lets you generate random values in O(1) time.
Hope this helps!
You can use the pair type and the 'lower_bound' from < algorithm >
http://www.cplusplus.com/reference/algorithm/lower_bound/.
Let's define your categories in terms of the upper edge:
typedef pair categories_t;
Then just make a vector of those edges and search it using binary search. See the full example below.
#include <string>
#include <vector>
#include <algorithm>
#include <iostream>
using namespace std;
typedef pair<double,string> category_t;
std::string findCategory(const vector<category_t> &categories, double x) {
vector<category_t>::const_iterator it=std::lower_bound(categories.begin(), categories.end(),category_t(x,""));
if(it==categories.end()){
return "";
}
return it->second;
}
int main (){
vector< category_t > edges;
edges.push_back(category_t(0,"bin n with upper edge at 0 (underflow)"));
edges.push_back(category_t(1,"bin A with upper edge at 1"));
edges.push_back(category_t(2.5,"bin B with upper edge at 2.5"));
edges.push_back(category_t(7,"bin C with upper edge at 7"));
edges.push_back(category_t(8,"bin D with upper edge at 8"));
edges.push_back(category_t(9,"bin E with upper edge at 9"));
edges.push_back(category_t(10,"bin F with upper edge at 10"));
vector< double > examples ;
examples.push_back(1);
examples.push_back(3.3);
examples.push_back(7.4);
examples.push_back(-5);
examples.push_back(15);
for( vector< double >::const_iterator eit =examples.begin();eit!=examples.end();++eit)
cout << "value "<< *eit << " : " << findCategory(edges,*eit) << endl;
}
The comparisons works the way we want it to since the double is the first in the pair, and pairs are compared first by comparing the first and then the second constituent. Else we would define a compare predicate as described at the page I linked above.