How to find number of different elements in sorted array? - c++

How to find number of different elements in sorted array with O(1)?
Using in C++, multimap container (STL).
I mean O(1) exactly.

It's not really possible to know in constant time how many unique items are in a collection, even if the collection is sorted, unless you only allow one instance of the item in the collection at any given time. If you do restrict the collection to only having unique items, then the answer is trivial; it's the number of items in the collection since they all have to be distinct.
In the case where you have an ordered collection of non-distinct items, you can find the number of distinct items by iterating through the collection and finding the state changes (when the current value isn't the same as the previous one). The number of distinct items is one more than the number of state changes (or the number of state changes if you start with the "empty" state and count the first item as a change from that).
You can also augment your data structure and add/delete algorithms to track the number of distinct items in the collection so that you can "find" the this number in constant time by simply querying a value that is updated during add/delete. This shouldn't affect the efficiency of either since you only need to determine, on add, whether the new item is the first of it's type by checking if the prev/next item has the same key and, on delete, whether the removed item is the last item of its type, by the same check.
Let's consider a simple illustration.
Let's say you have a magic bag containing several different colored blocks numbered from 1 to N. The bag is magic because whenever you reach into the bag you can either determine how many blocks are in the bag (the value of N) OR look at a block with the guarantee that each time you reach into the block you get the next block in color order, all the reds, all the greens, etc. until no more are left OR you can examine any single block by its number. What you want is to find out how many different colors of blocks are in the bag by reaching into the bag some fixed number of times.
Now, getting the number of total blocks in the bag takes one reach but does you no good because you want to know the number of different colors. Getting any fixed number of randomly selected blocks (less than N) takes a fixed number of reaches but does you no good because they don't tell you anything about the rest of the blocks in the bag. The only thing you can do is pull all of the blocks out one by one in order and find the number of times that the next block is a different color from the last one.
Now if you allow me to change how I put blocks into or take them out of the bag, I could keep track of how many colors of blocks are in the bag as I go, then it again becomes trivial to tell you. I just give you the value that I'm keeping track of. Essentially I'm trading a small amount of space (the place where I keep track of the value) and a bit extra time during add/delete for a larger amount of time trying to find the number of distinct colors later. You just need to decide if the trade-off is worth it.

Related

Best data structure for random (indexed) access/removal of pre-sorted data

The data set consists of increasing values. Data is always appended, and thus is inherently sorted. There is no need for random insertions in the middle or beginning of the set, however, random deletions will occur.
Operations consist of:
Finding the the nth value,
Finding the position N for a given value
Appending a sorted value at the end
Random removal of a given value or index
Value iteration/locality is not at all important
for similar questions people proposed using a map of values-index as a solution, however in this case the indexes themselves must be sorted, e.g. arr[N] < arr[N+x]. The array+map solution does not work here.
The sequence itself needs to be stored on disk and I have a key-value database (lmdb) at my disposal. Having something which is serialized/deserialized in its entirety is also something to be avoided, as it is usually opened and closed fairly quickly, and deserializing the set in its entirety would potentially evict precious pages from elsewhere.
The set will contain ~100k items (and several thousand of these sets may be open simultaneously).
The closest in-memory structure is probably a skiplist, but I've yet to find an implementation that supports finding the nth element, and also unsure how it would be loaded/stored on disk

How to track changes to a list

I have an immutable base list of items that I want to perform a number of operations on: edit, add, delete, read. The actual operations will be queued up and performed elsewhere (sent up to the server and a new list will be sent down), but I want a representation of what the list would look like with the current set of operations applied to the base list.
My current implementation keeps a vector of ranges and where they map to. So an unedited list has one range from 0 to length that maps directly to the base list. If an add is performed in index 5, then we have 3 ranges: 0-4 maps to base list 0-4. 5 maps to the new item, and 6-(length+1) maps to 5-length. This works, however with a lot of adds and deletes, reads degrades to O(n).
I've thought of using hashmaps but the shifts in ranges that can occur with inserts and deletes presents a challenge. Is there some way to achieve this so that reads are around O(1) still?
If you had a roughly balanced tree of ranges, where each node kept a record of the total number of elements below it in the tree, you could answer reads in worst case time the depth of the tree, which should be about log(n). Perhaps a https://en.wikipedia.org/wiki/Treap would be one of the easier balanced trees to implement for this.
If you had a lot of repetitious reads and few modifications, you might gain by keeping a hashmap of the results of reads since the last modification, clearing it on modification.

Implementing a mutable ranking table in c++

In an event-driven simulator, I need to keep track of the popularity of a large number of content elements from a catalog. More specifically I am interested in knowing the rank of any given content, i.e. its position in a list sorted by descending number of requests. I know that the number of requests per content is only going to be increased by one each time, so there is no dramatic shift in the ranking. Furthermore, elements are inserted or deleted from the catalog only in rare occasions, while requests are much more numerous and frequent. What is the best data structure to implement this?
These are the options that I have considered:
a std::map<ContentElement, unsigned int> mapping contents to the number of requests they received. Not a good choice, as it requires me to dump everything to a list and sort it whenever I want to know the ranking of a content, which is very often.
a boost::multi_index_container with two indexes, a hashed_unique for the ContentElement and an ordered_not_unique for the number of requests. This allows me to quickly retrieve a content in order to update its number of requests and to keep the container sorted as I do this through a modify call, but my understanding of the ordered index is that it still forces me to iterate through all its element in order to figure the rank of a content - I could not figure a simple way of extracting the position in the ranking from the ordered iterator.
a boost::bimap between content elements and ranking position, supported by an external sorted vector storing the number of requests per content. Essentially the rank of a content would also represent the index of the vector element with its number of requests. This allows me to do everything I want to do (e.g., easily go from content to rank and viceversa) and sorting the vector after a new request should require at most two swaps in the bimap. However it feels clumsy and error-prone as I could easily loose sync between the map and the vector and then everything would fall apart.
My guts tell me there must be a much simpler and more elegant way of handling this, but I could not find it. Can anyone help?
There is no need to do a full sort. The key insight here is that a ranking can only change by +1 or -1 when it is accessed. I would do the following...
Keep the element in a container of your choice, e.g.
map< elementId, elementInstance >
Maintain a linked list of element rankings, something like this:
list< rankingInstance >
The rankingInstance has a pointer to an elementInstance and the value of the current rank and current number of accesses. On access, you:
access the element in the map
get its current rank, and access count
update the count
using the current rank, access the linked list
check the neighbors
swap position in list if necessary
if swapping occurred, go back and update the two elements whose rank changed
It may seem so simple, but my suggestion is to use Bubble Sort on your list. Since, Bubble Sort compares and switches only the adjacent elements, which is your case, simply one up or one down move in the ranking. Your vector may keep the 'Rank' as key, 'ContentHash' as value in a vector. A map containing 'Content' or 'Content Reference' will also needed. I hope this very simple approach gives some insights about your problem.

Getting info from multiple maps in c++ and comparing

I'm trying to do something a bit weird with a bunch of maps in c++. (If I call them dictionaries, I apologize. Used to working in Python 2.7.)
I have fifty maps, each one representing the top five growing cities in a state. The key is the city name, the mapped value its rate of growth. (EX.Alaska: {anchorage: 112.55; boonies: 106.22;...} The five fastest growing cities in each state.) What I want to do is compare all of these mapped values, find the five with the highest, and print out the name of the city they belong to.
In addition to that fact I'm comparing 250 numbers to start with, I don't know the mapped values or keys until the program runs. It's taking this data from a massive txt file of over 19,000 lines, finding the five highest per state, and then creating a dictionary for each state.
I have no idea on how to access all of these values without knowing what their keys are, or even how to compare all of these numbers at once. I need more help with the former than the latter, but anything will be helpful.
EDIT:
Attempting to answer all the questions asked:
How I got the fifty maps is from a custom class I wrote. The txt file with the data is too large to be read all at once, so I wrote a class that will take read a certain section of the file, find the names and values of the top five in the section, and return it as a map with five pairs: the name of the city and its value. With each state being a section, it will return fifty maps, which each contain five pairs. So map Alaska will have {city: growth value; city2: growth value2...} Then there will be map Arizona, map Alabama, etc. At the moment, I'm calling the function 50 different times with different sections to read. Each time I call the function, it returns the map of the top five pairs. I hope that's clearer.
I need the data as a map because I need the name to print out at the end, in addition to the growth value to compare and find the best. However, now that I'm thinking about it more, I'm considering adding a function in the class that keeps a vector of all the city names and one of just the growth values. I could take the rate vector and compare, then find its corresponding name in the other vector with some kind of counter variable. Not sure if that would work any better or any easier, but a possibility that comes to mind.

Determining the best ADT for a priority queue with changeable elements (C++)

First post here and I'm a beginner - hope I'm making myself useful...
I'm trying to find and understand the ADT/concept that does the job I'm after. I'm guessing it's already out there.
I have an array/list/tree (container to be decided) of objects each of which has a count associated with how much it hasn't been used over iterations of a process. As iterations proceed the count for each object accumulates by 1. The idea is that sooner or later I'm going to need the memory that any unused objects are using so I'll delete them to make space for an object not in RAM (which will have an initial count of '0') - But, if it turns out that I use an object that is still in memory it's count is reset to '0', and I pat myself on the back for not having had to access the disk for its contents.
A cache?
The main process loop would have something similar to the following in it:
if (object needs to be added && (totalNumberOfObjects > someConstant))
object with highest count deleted from RAM and the (heap??)
newObject added with a count of '0'
if (an object already in RAM is accessed by the process)
accessedObject count is set to '0'
for (All objects in RAM)
count++
I could bash about for a (long and buggy time) and build my own mess, but I thought it'd be interesting to learn the most efficient way from word go.
Something like a heap?
You could use a heap for this, but I think it would be overkill. It sounds like you're not going to have a lot of different values for the counts, and you'll have a lot of objects with each count. If that's true, then you only need thread the objects onto a list of objects with the same count. These lists are themselves arranged in a dequeue (or 'deque' as C++ insists on calling it).
The key here is that you need to increment the count of all objects, and presumably you want that to be O(1) if possible, rather than O(N). And it is possible: the key is that each list's header contains also the difference of its count from the next smaller count. The header of the list with the smallest count contains a delta from 0, which is the smallest count. To increment the count of all objects, you only have to increase this single number by one.
To set an object's count to 0, you remove the object from its list (which means you always need to refer to objects by their list iterator, or you need to implement your own intrusive linked list), and either (1) add it to the bottom list, if that list has a count of 0, or (2) create a new bottom list with a count of 0 containing only that object.
The procedure for creating a new object is the same, except that you don't have to unlink it from its current list.
To evict an object from memory, you choose the object at the head of the top list (which is the list with the largest count). If that list becomes empty, you pop it off the dequeue. If you need more memory, you can repeat this operation.
So all operations, including "increment all counts", are O(1). Unfortunately, the storage overhead is two pointers per object, plus two pointers and an integer per unique count (at worst, this is the same as the number of objects, but presumably in practice it's much less). Since it's hard to imagine any other algorithm which uses less than one pointer plus a count for each object, this is probably not even a space-time tradeoff; the additional space requirements are minimal.