Search structure with history (persistence) - c++

I need a map-like data structure (in C++) for storing pairs (Key,T) with the following functionality:
You can insert new elements (Key,T) into the current structure
You can search for elements based on Key in the current structure
You can make a "snapshot" of the current version of the structure
You can switch to one of the versions of the structures which you took the snapshot of and continue all operations from there
Completely remove one of the versions
What I don't need
Element removal from the structure
Merging of different versions of the structure into one
Iteration over all (or some of) elements currently stored in the structure
In other words, you have some search structure that you can build up, but at any point you can jump in history, and expand the earlier/different version of the structure in a different way. Later on you may jump between those different versions.
In my project, Key and T are likely to be integers or pointer values, but not strings.
The primary objective is to reduce the time complexity; space consumption is secondary (but should be reasonable as well). To clarify, for me log(N)+log(S) (where N-number of elements, S-number of snapshots) would be enough, although faster is better :)
I have some rough idea how to implement it --- for example: being the structure a binary search tree, the insertion of a new element can clone the path from the root to the insertion location, while keeping the rest of the tree intact. Switching tree versions would be equivalent to picking a different version of the root node, for which some changes are simply not visible.
However, to make this custom tree efficient (e.g. self-balancing) it will require some additional effort and careful coding. Of course I can do it myself but perhaps there are already existing libraries to do exactly that?
Also, there is probably a proper name for this kind of data structure that I simply don't know, making my Google searches (or SO searches) total failures...
Thank you for your help!

I think what you are looking for is an immutable map. Functional (or functionally inspired) programming languages (such as Haskell or Scala) have immutable versions of most of the containers you'd find in the STL. Operations such as insertion/removal etc. then return a copy of the map (preserving the original) with the copy containing your requested modification. A lot of work has gone into designing the datastructures so that the copies are able to point to as much of the original datastructure as possible to reduce time and memory complexity of each operation.
You can find a lot more details in a book such as this one: http://www.amazon.co.uk/Purely-Functional-Structures-Chris-Okasaki/dp/0521663504.

While searching for some persistent search trees libraries I stumbled on this:
http://cg.scs.carleton.ca/~dana/pbst/
While it does not have the exact same functionality as needed, it seems pretty close to it. I will investigate.
(posting here, as someone may find it useful as well)

Related

Managing large spatial data set with attributes in C++

I have a data set with about 700 000 entries, and each entry is a set of 3D coordinates with attributes such as name, timestamp, ID, and so on.
Right now I'm just reading the coordinates and render them as points in OpenGL. However I want to associate each point with its corresponding attributes and I want to be able to sort and pick them during runtime based on their attributes. How would I go about to achieve this in an efficient manner?
I know I can put I can put the data in a struct and use stl sort for sorting, but is that a good design choice or is there a more efficient/elegant way of handling the problem?
The way I tend to look at these design choices is to first use one of the standard library containers (btw, if you need to "just" do lookup you don't necessarily have to sort, but you need a container that allows lookup), then check if this an "efficient enough" solution for the problem.
You can usually come up with a custom solution that is more efficient and maybe more elegant but you tend to run into two issues with that:
1) You end up having to implement some type of a container, which will cost you time both in implementation and debugging compared to a well understood and tested container that is already out there. Most of the time you're better off trying to solve the problem at hand rather than make it bigger by adding more code.
2) If someone else will have to maintain your code at some point, chances are they are familiar with standard library components both from a design and implementation perspective, but they won't be familiar with your custom container, thus increasing the learning curve.
If you consider each attribute of your point class as a component of a vector, then your selection process is a region query. Your example of a string attribute being equal to something means that the region is actually a line in your data space. However, there won't be any sorting made on other attributes within that selection, you will have to implement it by yourself, but it should be relatively straightforward for octrees, which partition data in ordered regions.
As advocated in another answer, try existing standard solutions first. If you can find an of the shelf implementation of one of these data structures:
R-tree
KD tree
BSP
Octree, or more likely, a n dimensional version of the quadtree or octree principle (I will use the term octree herein to denote the general data structure)
then go for it. These are the data structures I recommend for spatial data management.
You could also use an embedded RDBMS capable of working with spatial data (they usually implement R-tree for spatial indexing), but it may not be interesting if your dataset isn't dynamic.
If your dataset falls within the 10000 entries range, then by today standards it isn't that large, so using simpler structures should suffice. In that perimeter, I would go first for a simple std::vector, and use std::sort and std::find to filter the data in smaller set and sort it afterward.
I would probably try an ordered set or map on the most queried attribute in a second attempt, then do some benchmarks to pick the more performing solution.
For a more efficient one dimensional indexing algorithm (in essence, that`s what sets and maps are), you might want to try B-trees: there's C++ implementation available from google.
My third attempt would go toward an OpenCL solution (although if you are doing heavy OpenGL rendering, you might prefer doing the work on the CPU instead, but that depends on your framerate needs).
If your dataset is much larger, as it seems to be, then consider one of the more complex solutions I listed initially.
At any rate, without more details about your dataset and how you plan to use it, it will be difficult to provide a good solution, so the only real advice we can give is: try everthing you can and benchmark.
If you're dealing with point clouds, take a look at PCL, it could save you a lot of time and effort without having to dig into the intricacies of spatial indexing yourself. It also includes visualisation.

Boost flat_map container

Working on some legacy code, I am running into memory issues due mainly (I believe) to the extensive use of STL maps (particularly “maps-of-maps”.)
I am looking at Boost flat_map as a possible solution. Does anyone have any firsthand experience with flat_maps, in particular with regards improvements in speed and/or memory usage? I realize of course this can be very dependent on the types of data stored and the manner in which they are stored but still curious of folk’s actual experience.
Can anyone point me to some solid examples?
As an example: there are several cases in this code of a map-of-a-map; that is, a map where the value is another map.
By replacing the “inner” map with a pair of vectors, I reduced the memory footprint 10:1 (3G to 300M). Of course this can slow down searches but for this particular case it doesn’t seem to matter much. And it involved about a day of refactoring and careful testing.
Boost’s flat_map sounds like it might be just what I need but I can’t seem to find out much about it other than the class description on the Boost web site. Looking for some firsthand feedback.
Boost's flat_map is a binary-tree-based map implementation, except that that binary tree is stored as a (sorted) vector of key-value pairs.
You can basically figure out the answers regarding performance (relative to an std::map by yourself based on that fact:
Iterating the map or a large part of it should be super-fast, relatively
Lookup should typically be relatively fast
Adding or removing values is theoretically much slower, but in practice - assuming your key and value types are small and the number of map elements not very high - probably comparable in speed (or even better on small maps - often no allocation is necessary on insert)
etc.
In your case - maps-of-maps - you're going to lose some of the benefit of "flattening things out", since you'll have an outer map with a pointer to an inner map (if not more levels of indirection); but the flat map would at least help you reduce that. Also, supposing you have two levels of maps, you could arrange it so that you store all of the inner maps contiguously (either by constructing the inner maps appropriately or by instantiating them with your own allocator, a trickier affair); in that case, you could replace pointers to maps with map indices, reducing the amount of space they take up and making life easier for the compiler.
You might also want read Boost's documentation of flat_map; and you could also just use the force and read the source (and the source of the underlying flat_tree) - like I have; I dont actually have flat_map experience myself.
I know this is an old question, but this might be of use to someone finding this question.
I found that flat_map was a big improvement in searching, lookup and iterating large maps. The fact the map is using contiguous data in memory also makes inserting faster than you might expect due to great data locality. If you're doing more inserts than lookups in your map then it might not be for you.
Having said that, repeatedly inserting a random value into a sorted vector is faster than the same on a linked list because of the data locality - despite what Big O might tell you. (tested in VS2017 and G++ 4.8).

Efficient Dictionary lookup

For my C++ application, there is a requirement to check if a word is a valid English dictionary word or not. What is the best way to do it. Is there freely available dictionary that I can make use of. I just need a collection of all possible words. How to make this lookup least expensive. Do I need to hash it.
Use either a std::set<std::string> or a std::unordered_set<std::string>. The latter is new in C++0x and may or may not be supported by your C++ Standard Library implementation; if it does not support it, it may include a hash_set of some kind: consult your documentation to find out.
Which of these (set, which uses a binary search tree, and unordered_set, which uses a hashtable) is more efficient depends on the number of elements you are storing in the container and how your Standard Library implementation implements them. Your best bet is to try both and see which performs better for your specific scenario.
Alternatively, if the list of words is fixed, you might consider using a sorted std::vector and using std::binary_search to find words in it.
With regards to the presence of a word list, it depends on the platform.
Under Linux, /usr/share/dict/words contains a list of English words
that might meet your needs. Otherwise, there are doubtlessly such lists
available on the network.
Given the size of such lists, the most rapid access will be to load it
into a hash table. std::unsorted_set, if you have it; otherwise, many
C++ compilers come with a hash_set, although different compilers have
a slightly different interface for it, and put it in different
namespaces. If that still has performance problems, it's possible to do
better if you know the number of entries in advance (so the table never
has to grow), and implement the hash table in an std::vector (or even a
C style array); handling collisions will be a bit more complicated,
however.
Another possibility would be a trie. This will almost certainly result
in the least number of basic operations in the lookup, and is fairly
simple to implement. Typical implementations will have very poor
locality, however, which could make it slower than some of the other
solutions in actual practice (or not—the only way to know is to
implement both and measure).
I actually did this a few months ago, or something close to this. You can probably find one online for free.
Like on this website: http://wordlist.sourceforge.net/
Just put it in a text file, and compare words with what is on the list. It should be order n with n being the number of words in the list. Do you need the time complexity faster?
Hope this helps.

What data structures are commonly used for LRU caches and quickly locating objects?

I intended to implement a HashTable to locate objects quickly which is important for my application.
However, I don't like the idea of scanning and potentially having to lock the entire table in order to locate which object was last accessed. Tables could be quite large.
What data structures are commonly used to overcome that?
e.g. I thought I could throw objects into a FIFO as well as the cache in order to know how old something is. But that's not going to support an LRU algorithm.
Any ideas? how does squid do it?
Linked lists are good for LRU caches. For indexed lookups inside the linked list (to move the entry to the most recently used end of the linked list), use a HashTable. The least recently used entry will always be last in the linked list.
You might find this article on LRU cache implementation using STL containers (or a boost::bimap-based alternative) interesting. With STL, basically you use a combination of a map (for fast key-value lookup) and a separate list of keys or iterators into that map (for easy maintenance of access history).

Help me understand how the conflict between immutability and running time is handled in Clojure

Clojure truly piqued my interest, and I started going through a tutorial on it:
http://java.ociweb.com/mark/clojure/article.html
Consider these two lines mentioned under "Set":
(def stooges (hash-set "Moe" "Larry" "Curly")) ; not sorted
(def more-stooges (conj stooges "Shemp")) ; -> #{"Moe" "Larry" "Curly" "Shemp"}
My first thought was that the second operation should take constant time to complete; otherwise functional language might have little benefit over an object-oriented one. One can easily imagine a need to start with [nearly] empty set, and populate it and shrink it as we go along. So, instead of assigning the new result to more-stooges, we could re-assign it to itself.
Now, by the marvelous promise of functional languages, side effects are not to be concerned with. So, sets stooges and more-stooges should not work on top of each other ever. So, either the creation of more-stooges is a linear operation, or they share a common buffer (like Java's StringBuffer) which would seem like a very bad idea and conflict with immutability (subsequently stooges can drop an element one-by-one).
I am probably reinventing a wheel here. it seems like the hash-set would be more performant in clojure when you start with the maximum number of elements and then remove them one at a time until empty set as oppose to starting with an empty set and growing it one at a time.
The examples above might not seem terribly practical, or have workarounds, but the object-oriented language like Java/C#/Python/etc. has no problem with either growing or shrinking a set one or few elements at a time while also doing it fast.
A [functional] language which guarantees(or just promises?) immutability would not be able to grow a set as fast. Is there another idiom that one can use which somehow can help avoiding doing that?
For someone familiar with Python, I would mention set comprehension versus an equivalent loop approach. The running time of the two is tiny bit different, but that has to do with relative speeds of C, Python, interpreter and not rooted in complexity. The problem I see is that set comprehension is often a better approach, but NOT ALWAYS the best approach, for the readability might suffer a great deal.
Let me know if the question is not clear.
The core Immutable data structures are one of the most fascinating parts of the language for me also. their is a lot to answering this question and Rich does a really great job of it in this video:
http://blip.tv/file/707974
The core data structures:
are actually fully immutable
the old copies are also immutable
performance does not degrade for the old copies
access is constant (actually bounded <= a constant)
all support efficient appending, concatenating (except lists and seqs) and chopping
How do they do this???
the secret: it's pretty much all trees under the hood (actually a trie).
But what if i really want to edit somthing in place?
you can use clojure's transients to edit a structure in place and then produce a immutable version (in constant time) when you are ready to share it.
as a little background: a Trie is a tree where all the common elements of the key are hoisted up to the top of the tree. the sets and maps in clojure use trie where the indexes are a hash of the key you are looking for. it then breaks the hash up into small chunks and uses each chunk as the key to one level of the hash-trie. This allows the common parts of both the new and old maps to be shared and the access time is bounded because there can only be a fixed number of branches because the hash used as in input has a fixed size.
Using these hash tries also helps prevent big slowdowns during the re-balancing used by a lot of other persistent data structures. so you will actually get fairly constant wall-clock-access-time.
I really reccomend the (relativly short)_ book: Purely Functional Data Structures
In it he covers a lot of really interesting structures and concepts like "removing amortization" to allow true constant time access for queues. and things like lazy-persistent queues. the author even offers a free copy in pdf here
Clojure's data structures are persistent, which means that they are immutable but use structural sharing to support efficient "modifications". See the section on immutable data structures in the Clojure docs for a more thorough explanation. In particular, it states
Specifically, this means that the new version can't be created using a full copy, since that would require linear time. Inevitably, persistent collections are implemented using linked data structures, so that the new versions can share structure with the prior version.
These posts, as well as some of Rich Hickey's talks, give a good overview of the implementation of persistent data structures.