ocaml extremely large data structure suggestions - list

I am looking for suggestions on what kind of data-structure to use for extremely large structures in OCaml that scale well.
By scales well, I don't want stack overflows, or exponential heap growth, assuming there is enough memory. So this pretty much eliminates the standard lib's List.map function. Speed isn't so much an issue.
But for starters, let's assume I'm operating in the realm of 2^10 - 2^100 items.
There are only three "manipulations" I perform on the structure:
(1) a map function on subsets of the structure, which either increases or decreases the structure
(2) scanning the structure
(3) removal of specific pairs of items in the structure that satisfy a particular criterion
Originally I was using regular lists, which is still highly desirable, because the structure is constantly changing. Usually after all manipulations are performed, the structure has at most either doubled in size (or something thereabouts), or reduced to the empty list []. Perhaps the doubling dooms me from the beginning but it is unavoidable.
In any event, around 2^15 --- 2^40 items start causing severe problems (probably due to the naive list functions I was using as well). The program uses 100% of the cpu, but almost no memory, and generally after a day or two it stack-overflows.
I would prefer to start using more memory, if possible, in order to continue operating in larger spaces.
Anyway, if anyone has any suggestions it would be much appreciated.

If you have enough space, in theory, to contain all items of your data structure, you should look at data structures that have an efficient memory representation, with as few bookeeping as possible. Dynamic arrays (that you resize exponentially when you need more space) are more efficiently stored than list (that pay a full word to store the tail of each cell), so you'd get roughly twice as much elements for the same memory use.
If you cannot hold all elements in memory (this is what your number look like), you should go for a more abstract representation. It's difficult to tell more without more information on what your elements are. But maybe an example of abstract representation would help you devise what you need.
Imagine that I want to record set of integers. I want to make unions, intersections of those sets, and also some more funky operations such as "get all elements that are multiple". I want to be able to do that for really large sets (zillions of distinct integers), and then I want to be able to pick one element, any one, in this set I have built. Instead of trying to store lists of integers, or set of integers, or array of booleans, what I can do is store the logical formulas corresponding to the definition of those sets: a set of integers P is characterized by a formula F such that F(n) ā‡” nāˆˆP. I can therefore define a type of predications (conditions):
type predicate =
| Segment of int * int (* n āˆˆ [a;b] *)
| Inter of predicate * predicate
| Union of predicate * predicate
| Multiple of int (* n mod a = 0 *)
Storing these formulas requires little memory (proportional to the number of operations I want to apply in total). Building the intersection or the union takes constant time. Then I'll have some work to do to find an element satisfying the formula; basically I'll have to reason about what those formulas mean, get a normal form out of them (they are all of the form "the elements of a finite union of interval satisfying some modulo criterions"), and from there extract some element.
In the general case, when you get a "command" on your data set, such that "add the result of mapping over this subset", you can always, instead of actually evaluating this command, store this as data ā€“ the definition of your structure. The more precisely you can describe those commands (eg. you say "map", but storing an (elem -> elem) function will not allow you to reason easily on the result, maybe you can formulate that mapping operation as a concrete combination of operations), the more precisely you will be able to work on them at this abstract level, without actually computing the elements.

Related

Tree-like data structure in c++ optimal for compact storage only

There are tons of questions about tree-like data structures, also in c++.
Here is a very specific one that I need: I have to store data consisting of assignments of natural numbers to finite sequences of natural numbers. So, a typical entry would be something like (5,11,6,2,4,8,8,3,1) -> 788569. My only priority is to store these as compactly as possible. So I would like to have a program which does the following: it looks up x=f(5), then y=x(11), then z=y(6), then t=z(2), then u=t(4) then v=u(8), then w=v(8), then s=w(3) then r=s(1); if r(0) is defined, it must return 788569; if it is undefined, another program should then compute this value according to an algorithm supplied by me, make r(0) defined and equal to this particular value.
Similarly if on any intermediate stage the result was undefined - say, u(8) was undefined - the program should extend the corresponding branch with v=u(8), s=v(3), r=s(1) and r(0)=788569 (again computing the latter value according to the algorithm supplied by me).
Unfortunately I cannot estimate how many such sequences up to given length will occur in my computations; definitely not all, but I do not know the proportion. Eventually I will probably need to erase some of the data to prevent memory overflow; to keep it simple, let us assume I will just chop off least visited branches.
So is there any ready-made structure in c++ for such kind of storage?

What are some efficient ways to de-dupe a set of > 1 million strings?

For my project, I need to de-dupe very large sets of strings very efficiently. I.e., given a list of strings that may contain duplicates, I want to produce a list of all the strings in that list, but without any duplicates.
Here's the simplified pseudocode:
set = # empty set
deduped = []
for string in strings:
if !set.contains(string):
set.add(string)
deduped.add(string)
Here's the simplified C++ for it (roughly):
std::unordered_set <const char *>set;
for (auto &string : strings) {
// do some non-trivial work here that is difficult to parallelize
auto result = set.try_emplace(string);
}
// afterwards, iterate over set and dump strings into vector
However, this is not fast enough for my needs (I've benchmarked it carefully). Here are some ideas to make it faster:
Use a different C++ set implementation (e.g., abseil's)
Insert into the set concurrently (however, per the comment in the C++ implementation, this is hard. Also, there will be performance overhead to parallelizing)
Because the set of strings changes very little across runs, perhaps cache whether or not the hash function generates no collisions. If it doesn't generate any (while accounting for the changes), then strings can be compared by their hash during lookup, rather than for actual string equality (strcmp).
Storing the de-duped strings in a file across runs (however, although that may seem simple, there are lots of complexities here)
All of these solutions, I've found, are either prohibitively tricky or don't provide that big of a speedup. Any ideas for fast de-duping? Ideally, something that doesn't require parallelization or file caching.
You can try various algorithms and data structures to solve your problem:
Try using a prefix tree (trie), a suffix machine, a hash table. A hash table is one of the fastest ways to find duplicates. Try different hash tables.
Use various data attributes to reduce unnecessary calculations. For example, you can only process subsets of strings with the same length.
Try to implement the "divide and conquer" approach to parallelize the computations. For example, divide the set of strings by the number of subsets equal to the hardware threads. Then combine these subsets into one. Since the subsets will be reduced in size in the process (if the number of duplicates is large enough), combining these subsets should not be too expensive.
Unfortunately, there is no general approach to this problem. To a large extent, the decision depends on the nature of the data being processed. The second item on my list seems to me the most promising. Always try to reduce the computations to work with a smaller data set.
You can significantly parallelize your task by implementing simplified version of std::unordered_set manually:
Create arbitrary amount of buckets (probably should be proportional or equal to amount of threads in thread pool).
Using thread pool calculate hashes of your strings in parallel and split strings with their hashes btw buckets. You may need to lock individual buckets when adding your strings there but operation should be short and/or you may use lock free structure.
Process each bucket individually using your thread pool - compare hashes and if they equal compare string themselves.
You may need to experiment with bucket size and check how it would affect performance. Logically it should be not too big on one side but not too small on another - to prevent congestion.
Btw from your description it sounds that you load all strings into memory and then eliminate duplicates. You may try to read your data directly to std::unordered_set instead then you will save memory and increase speed as well.

Finding k smallest/largest elements in array with focus on memory

I have an unsigned int array with n elements (with n being at most around 20-25 elements). Duplicates are possible.
I know that the smallest k values are of type A and the other (larger) n-k values are of type B. In order to differentiate between A and B I need to find the indices of the k smallest values (or the n-k largest values, depending on what is easier/faster). The original array must not be altered as the element's index contains information.
There are multiple solutions for this problem on the web (e.g. here). However, most of them try to optimize processing time and neglect memory usage.
As I am implementing the code in C++ on a (Arduino based) microcontroller, I have to focus on low memory usage and, if necessary, take a slightly longer processing time. I therefore feel unsafe using pointers and recursion (maybe I wouldn't if I knew more about it, but in fact I don't).
Can you recommend which algorithm would be best for that task (an implementation is welcome but not essential)?

Efficient data structure to map integer-to-integer with find & insert, no allocations and fixed upper bound

I am looking for input on an associative data structure that might take advantage of the specific criteria of my use case.
Currently I am using a red/black tree to implement a dictionary that maps keys to values (in my case integers to addresses).
In my use case, the maximum number of elements is known up front (1024), and I will only ever be inserting and searching. Searching happens twenty times more often than inserting. At the end of the process I clear the structure and repeat again. There can be no allocations during use - only the initial up front one. Unfortunately, the STL and recent versions of C++ are not available.
Any insight?
I ended up implementing a simple linear-probe HashTable from an example here. I used the MurmurHash3 hash function since my data is randomized.
I modified the hash table in the following ways:
The size is a template parameter. Internally, the size is doubled. The implementation requires power of 2 sizes, and traditionally resizes at 75% occupation. Since I know I am going to be filling up the hash table, I pre-emptively double it's capacity to keep it sparse enough. This might be less efficient when adding small number of objects, but it is more efficient once the capacity starts to fill up. Since I cannot resize it I chose to start it doubled in size.
I do not allow keys with a value of zero to be stored. This is okay for my application and it keeps the code simple.
All resizing and deleting is removed, replaced by a single clear operation which performs a memset.
I chose to inline the insert and lookup functions since they are quite small.
It is faster than my red/black tree implementation before. The only change I might make is to revisit the hashing scheme to see if there is something in the source keys that would help make a cheaper hash.
Billy ONeal suggested, given a small number of elements (1024) that a simple linear search in a fixed array would be faster. I followed his advice and implemented one for side by side comparison. On my target hardware (roughly first generation iPhone) the hash table outperformed a linear search by a factor of two to one. At smaller sizes (256 elements) the hash table was still superior. Of course these values are hardware dependant. Cache line sizes and memory access speed are terrible in my environment. However, others looking for a solution to this problem would be smart to follow his advice and try and profile it first.

Efficiently insert integers into a growing array if no duplicates

There is a data structure which acts like a growing array. Unknown amount of integers will be inserted into it one by one, if and only if these integers has no dup in this data structure.
Initially I thought a std::set suffices, it will automatically grow as new integers come in and make sure no dups.
But, as the set grows large, the insertion speed goes down. So any other idea to do this job besides hash?
Ps
I wonder any tricks such as xor all the elements or build a Sparse Table (just like for rmq) would apply?
If you're willing to spend memory on the problem, 2^32 bits is 512MB, at which point you can just use a bit field, one bit per possible integer. Setting aside CPU cache effects, this gives O(1) insertion and lookup times.
Without knowing more about your use case, it's difficult to say whether this is a worthwhile use of memory or a vast memory expense for almost no gain.
This site includes all the possible containers and layout their running time for each action ,
so maybe this will be useful :
http://en.cppreference.com/w/cpp/container
Seems like unordered_set as suggested is your best way.
You could try a std::unordered_set, which should be implemented as a hash table (well, I do not understand why you write "besides hash"; std::set normally is implemented as a balanced tree, which should be the reason for insufficient insertion performance).
If there is some range the numbers fall in, then you can create several std::set as buckets.
EDIT- According to the range that you have specified, std::set, should be fast enough. O(log n) is fast enough for most purposes, unless you have done some measurements and found it slow for your case.
Also you can use Pigeonhole Principle along with sets to reject any possible duplicate, (applicable when set grows large).
A bit vector can be useful to detect duplicates
Even more requirements would be necessary for an optimal decision. This suggestion is based on the following constraints:
Alcott 32 bit integers, with about 10.000.000 elements (ie any 10m out of 2^32)
It is a BST (binary search tree) where every node stores two values, the beginning and the end of a continuous region. The first element stores the number where a region starts, the second the last. This arrangement allows big regions in the hope that you reach you 10M limit with a very small tree height, so cheap search. The data structure with 10m elements would take up 8 bytes per node, plus the links (2x4bytes) maximum two children per node. So that make 80M for all the 10M elements. And of course, if there are usually more elements inserted you can keep track of the once which are not.
Now if you need to be very careful with space and after running simulations and/or statistical checks you find that there are lots of small regions (less than 32 bit in length), you may want to change your node type to one number which starts the region, plus a bitmap.
If you don't have to align access to the bitmap and, say, you only have continuous chunks with only 8 elements, then your memo requirement becuse 4+1 for the node and 4+4 bytes for the children. Hope this helps.