How to efficiently gather the repeating elements in a given array? - c++

I'd like to gather the duplicates in a given array. For example, i have an array like this:
{1,5,3,1,5,6,3}
and i want the result to be:
{3,3,1,1,5,5,6}
In my case, the number of cluster is unknowen before calculation, and the order is not concerned.
I achived this by using the bult-in function Sort in C++. However, actually the ordering is not necessary. Hence, i guess there are probably more efficient methods to accomplish it.
Thanks in advance.

First, construct a histogram noting frequencies of each number. You can use a dictionary to accomplish this in O(n) time and space.
Next, loop over the dictionary's keys (order is unimportant here) and for each one, write a number of instances of that key equal to the corresponding value.
Example:
{1,5,3,1,5,6,3} input
{1->2,5->2,3->2,6->1} histogram dictionary
{1,1,5,5,3,3,6} wrote two 1s, two 5s, two 3s, then one 6
This whole thing is O(n) time and space. Certainly you can't do better than O(n) time. Whether you can do better than O(n) space or not while maintaining O(n) time I cannot say.

Related

How to find the row in a matrix which is closest to a test row?

I am currently trying to come up with a smart way to store my data, such that I can easily search, sort and insert.
My data consist of a std::pair<vector<int>,string> which is stored in a std::vector, forming sort of an 2d matrix.
Something like this:
Each vector has a number sequence that matches a string.
My problem here is when it should be tested.
A test, consist of testing it with a test vector, which might be the same length or be smaller than those stored in the matrix. how do I find out which string the test vector best matches with?
How do i search for that efficiently?
Some thought of ideas on how to solve the problem, and some of the problems with them:
One idea was to make sub sums, depending on the length of the test vector, and then see which of the sub sums, best matches the sum of the test vector.
Problem: I am looking to the same pattern, so the same sum could occur given another pattern.
Another idea was to make a copy of the matrix, sort it column wise, and make a search for a each index, and keep track of which the string were matching the best..
Problem: This though requires sorting all the columns - the matrix is way larger than showed above, it has around 1000 columns, and it seems too expensive to make one search for a one sort - given the amount of time I would spent on it - A insert possibility also need to be implemented, so something efficient would be appreciated.

Hash that returns the same value for all numbers in range?

I'm working on a problem where I have an entire table from a database in memory at all times, with a low range and high range of 9-digit numbers. I'm given a 9-digit number that I need to use to lookup the rest of the columns in the table based on whether that number falls in the range. For example, if the range was 100,000,000 to 125,000,000 and I was given a number 117,123,456, then I would know that I'm in the 100-125 mil range, and whatever vector of data that points to is what I will be using.
Now the best I can think of for lookup time is log(n) run time. This is OK, at best, but still pretty slow. The table has at least 100,000 entries and I will need to look up values in this table tens-of-thousands, if not hundred-thousands of times, per execution of this application (10+ times/day).
So I was wondering if it was possible to use an unordered_set instead, writing my own Hash function that ALWAYS returns the same hash-value for every number in range. Using the same example above, 100,000,000 through 125,000,000 will always return, for example, a hash value of AB12CD. Then when I use the lookup value of 117,123,456, I will get that same AB12CD hash and have a lookup time of O(1).
Is this possible, and if so, any ideas how?
Thanks in advance.
Yes. Assuming that you can number your intervals in order, you could fit a polynomial to your cutoff values, and receive an index value from the polynomial. For instance, with cutoffs of 100,000,000, 125,000,000, 250,000,000, and 327,000,000, you could use points (100, 0), (125, 1), (250, 2), and (327, 3), restricting the first derivative to [0, 1]. Assuming that you have decently-behaved intervals, you'll be able to fit this with an (N+2)th-degree polynomial for N cutoffs.
Have a table of desired hash values; use floor[polynomial(i)] for the index into the table.
Can you write such a hash function? Yes. Will evaluating it be slower than a search? Well there's the catch...
I would personally solve this problem as follows. I'd have a sorted vector of all values. And then I'd have a jump table of indexes into that vector based on the value of n >> 8.
So now your logic is that you look in the jump table to figure out where you are jumping to and how many values you should consider. (Just look at where you land versus the next index to see the size of the range.) If the whole range goes to the same vector, you're done. If there are only a few entries, do a linear search to find where you belong. If they are a lot of entries, do a binary search. Experiment with your data to find when binary search beats a linear search.
A vague memory suggests that the tradeoff is around 100 or so because predicting a branch wrong is expensive. But that is a vague memory from many years ago, so run the experiment for yourself.

Fastest way to copy into an array random elements from another array C++

The title almost tells everything,but I will exemplify this: suppose that you have an array a of chars, and another array b also of chars. Is there a better way to put in a only the char located at prime positions in b? Suppose that we have an array with prime positions.
For now my naive code looks like this.
for(i = 0; i < n; i++)
a[i] = b[j + prime[i]];
Here prime[i] stores the prime positions of b and b is much larger than a,j is an arbitrary position in b(there will not be an out of bound problem because j+prime[i] does not exceed border of b).
What is better? One way is: If the prime[] locations are known at compile time, then we could add a prefetch to get the cache lines in ahead of time.
This is making the memory access time better.
You can either do this when you read (or copy) values into the array, using a prime function that tells you if a number is prime or not.
A way I sketched quickly is to generate prime numbers until they reach your array capacity and simply iterate through them and copy the desired elements from your a array. I can think of several ways of optimizing this, such as having a "preprocess" function that generates prime numbers in your program so you can reuse the list.
The prime number list will get cached and it will take a lot less time to be accessed(it s unlikely that you have an extremely huge prime number list)
Let's look at this from an algorithmic perspective.
You want to perform a hash function on each of the entries in array A. Assuming that you know nothing about the state of the items in array A, then that places the lower bound of run time for the algorithm at O(n), linear time. You must iterate through every member because you don't have any more information that could assist you in "skipping" some elements or optimizing the process.
That said, the challenge then becomes keeping the algorithm down at O(n). The code you demonstrate does do this, assuming you then follow up with copying the non-prime numbers in the same manner. So for the copying step, no there is not a way to make this any faster from an algorithm point of view. That doesn't mean that how you perform the hashing step won't affect the speed, though.

Can we know if a collection is almost sorted without applying a sort algorithm?

In the wikipedia article on sorting algorithms,
http://en.wikipedia.org/wiki/Sorting_algorithm#Summaries_of_popular_sorting_algorithms
under Bubble sort it says:Bubble sort can also be used efficiently on a list of any length that is nearly sorted (that is, the elements are not significantly out of place)
So my question is: Without sorting the list using a sorting algoithm first, how can one know if that is nearly sorted or not?
Are you familiar with the general sorting lower bound? You can prove that in a comparison-based sorting algorithm, any sorting algorithm must make Ω(n log n) comparisons in the average case. The way you prove this is through an information-theoretic argument. The basic idea is that there are n! possible permutations of the input array, and since the only way you can learn about which permutation you got is to make comparisons, you have to make at least lg n! comparisons in order to be certain that you know the structure of your input permutation.
I haven't worked out the math on this, but I suspect that you could make similar arguments to show that it's difficult to learn how sorted a particular array is. Essentially, if you don't do a large number of comparisons, then you wouldn't be able to tell apart an array that's mostly sorted from an array that is actually quite far from sorted. As a result, all the algorithms I'm aware of that measure "sortedness" take a decent amount of time to do so.
For example, one measure of the level of "sortedness" in an array is the number of inversions in that array. You can count the number of inversions in an array in time O(n log n) using a divide-and-conquer algorithm based on mergesort, but with that runtime you could just sort the array instead.
Typically, the way that you'd know that your array was mostly sorted was to know something a priori about how it was generated. For example, if you're looking at temperature data gathered from 8AM - 12PM, it's very likely that the data is already mostly sorted (modulo some variance in the quality of the sensor readings). If your data looks at a stock price over time, it's also likely to be mostly sorted unless the company has a really wonky trajectory. Some other algorithms also partially sort arrays; for example, it's not uncommon for quicksort implementations to stop sorting when the size of the array left to sort is small and to follow everything up with a final insertion sort pass, since every element won't be very far from its final position then.
I don't believe there exists any standardized measure of how sorted or random an array is.
You can come up with your own measure - like count the number of adjacent pairs which are out of order (suggested in comment), or count the number of larger numbers which occur before smaller numbers in the array (this is trickier than a simple single pass).

How to efficiently *nearly* sort a list?

I have a list of items; I want to sort them, but I want a small element of randomness so they are not strictly in order, only on average ordered.
How can I do this most efficiently?
I don't mind if the quality of the random is not especially good, e.g. it simply based on the chance ordering of the input, e.g. an early-terminated incomplete sort.
The context is implementing a nearly-greedy search by introducing a very slight element of inexactness; this is in a tight loop and so the speed of sorting and calling random() are to be considered
My current code is to do a std::sort (this being C++) and then do a very short shuffle just in the early part of the array:
for(int i=0; i<3; i++) // I know I have more than 6 elements
std::swap(order[i],order[i+rand()%3]);
Use first two passes of JSort. Build heap twice, but do not perform insertion sort. If element of randomness is not small enough, repeat.
There is an approach that (unlike incomplete JSort) allows finer control over the resulting randomness and has time complexity dependent on randomness (the more random result is needed, the less time complexity). Use heapsort with Soft heap. For detailed description of the soft heap, see pdf 1 or pdf 2.
You could use a standard sort algorithm (is a standard library available?) and pass a predicate that "knows", given two elements, which is less than the other, or if they are equal (returning -1, 0 or 1). In the predicate then introduce a rare (configurable) case where the answer is random, by using a random number:
pseudocode:
if random(1000) == 0 then
return = random(2)-1 <-- -1,0,-1 randomly choosen
Here we have 1/1000 chances to "scamble" two elements, but that number strictly depends on the size of your container to sort.
Another thing to add in the 1000 case, could be to remove the "right" answer because that would not scramble the result!
Edit:
if random(100 * container_size) == 0 then <-- here I consider the container size
{
if element_1 < element_2
return random(1); <-- do not return the "correct" value of -1
else if element_1 > element_2
return random(1)-1; <-- do not return the "correct" value of 1
else
return random(1)==0 ? -1 : 1; <-- do not return 0
}
in my pseudocode:
random(x) = y where 0 <= y <=x
One possibility that requires a bit more space but would guarantee that existing sort algorithms could be used without modification would be to create a copy of the sort value(s) and then modify those in some fashion prior to sorting (and then use the modified value(s) for the sort).
For example, if the data to be sorted is a simple character field Name[N] then add a field (assuming data is in a structure or class) called NameMod[N]. Fill in the NameMod with a copy of Name but add some randomization. Then 3% of the time (or some appropriate amount) change the first character of the name (e.g., change it by +/- one or two characters). And then 10% of the time change the second character +/- a few characters.
Then run it through whatever sort algorithm you prefer. The benefit is that you could easily change those percentages and randomness. And the sort algorithm will still work (e.g., it would not have problems with the compare function returning inconsistent results).
If you are sure that element is at most k far away from where they should be, you can reduce quicksort N log(N) sorting time complexity down to N log(k)....
edit
More specifically, you would create k buckets, each containing N/k elements.
You can do quick sort for each bucket, which takes k * log(k) times, and then sort N/k buckets, which takes N/k log(N/k) time. Multiplying these two, you can do sorting in N log(max(N/k,k))
This can be useful because you can run sorting for each bucket in parallel, reducing total running time.
This works if you are sure that any element in the list is at most k indices away from their correct position after the sorting.
but I do not think you meant any restriction.
Split the list into two equally-sized parts. Sort each part separately, using any usual algorithm. Then merge these parts. Perform some merge iterations as usual, comparing merged elements. For other merge iterations, do not compare the elements, but instead select element from the same part, as in the previous step. It is not necessary to use RNG to decide, how to treat each element. Just ignore sorting order for every N-th element.
Other variant of this approach nearly sorts an array nearly in-place. Split the array into two parts with odd/even indexes. Sort them. (It is even possible to use standard C++ algorithm with appropriately modified iterator, like boost::permutation_iterator). Reserve some limited space at the end of the array. Merge parts, starting from the end. If merged part is going to overwrite one of the non-merged elements, just select this element. Otherwise select element in sorted order. Level of randomness is determined by the amount of reserved space.
Assuming you want the array sorted in ascending order, I would do the following:
for M iterations
pick a random index i
pick a random index k
if (i<k)!=(array[i]<array[k]) then swap(array[i],array[k])
M controls the "sortedness" of the array - as M increases the array becomes more and more sorted. I would say a reasonable value for M is n^2 where n is the length of the array. If it is too slow to pick random elements then you can precompute their indices beforehand. If the method is still too slow then you can always decrease M at the cost of getting a poorer sort.
Take a small random subset of the data and sort it. You can use this as a map to provide an estimate of where every element should appear in the final nearly-sorted list. You can scan through the full list now and move/swap elements that are not in a good position.
This is basically O(n), assuming the small initial sorting of the subset doesn't take a long time. Hopefully you can build the map such that the estimate can be extracted quickly.
Bubblesort to the rescue!
For a unsorted array, you could pick a few random elements and bubble them up or down. (maybe by rotation, which is a bit more efficient) It will be hard to control the amount of (dis)order, even if you pick all N elements, you are not sure that the whole array will be sorted, because elements are moved and you cannot ensure that you touched every element only once.
BTW: this kind of problem tends to occur in game playing engines, where the list with candidate moves is kept more-or-less sorted (because of weighted sampling), and sorting after each iteration is too expensive, and only one or a few elements are expected to move.