What's an efficient way to store a chessboard in LISP? - clojure

What's an efficient way to store a chessboard in LISP for example to solve the 8-queens puzzle?

For the 8 queens problem, your most efficient storage is going to be an array of 8 bytes. Clojure provides the byte-array method to simplify the process of creating just such an array. Treat each byte as an array of 8 bits, and use 0 for an empty square and 1 for a queen.
This will not work if you intend to use more than one type of Chess piece; additionally, you should consider a different approach if you want variable board sizes.

To solve the eight queens problem efficiently, you are looking for an efficient way to represent a partial solution.
If we number ranks and files 0 to 7, and work progressively through the ranks, then a vector rank -> file does the job.
The ranks take care of themselves.
The free files are what remains of (set (range 8)).
The free rising diagonals are what remains of (set (range 15)).
The free falling diagonals are what remains of (set (range -7 8)).
... where, for every square [i j],
the rank is i
the file is j
the rising-diagonal is (+ i j)
the falling-diagonal is (- i j)
You could, as #WolfeFan suggests, use bit-sets to store the free or occupied slots. But the storage concerned is negligible in either representation. And which representation is faster, I wouldn't care to guess.

Related

C++ iterate vector randomly

I'm working on a multithreaded program where all threads share some vector (read-only). The goal of each thread is to walk the entire vector. Nonetheless, all threads must visit this vector in a different way.
Since the vector is const and shared among all threads, i cannot use random_shuffle and just iterate over it. For now my solution is to build
a crossref vector that will contain indices over the shared vector and then
shuffle this vector, i.e.
std::vector<int> crossref(SIZE) ; // SIZE is the size of the shared vector
std::iota (std::begin(crossref), std::end(crossref), 0); // Fill with indices ref
std::mt19937 g(SEED); // each thread has it own seed.
std::shuffle (crossref_.begin(), crossref_.end(), g); // Shuffle it
Nonetheless, doing this reveal some problems (1) it is not very efficient since every thread needs to access its crossref vector before accessing the shared one, (2) i have some performances issue because of the amount of memory required : the shared vector is very big and i have a lot of thread and processors.
Does anyone has some improvement ideas that will avoid the need of extra memory?
You can use the algebraic notion of primitive root modulo n.
Basically
If n is a positive integer, the integers between 1 and n − 1 that are
coprime to n form the group of primitive classes modulo n. This group
is cyclic if and only if n is equal to 2, 4, p^k, or 2p^k where p^k is
a power of an odd prime number
Wikipedia displays how you can generate numbers below 7 using 3 as generator.
From this statement you derive an algorithm.
Take your number n
Find the next prime number m which is bigger than n
For each of your thread pick a unique random number F(0) between 2 and m
Compute the next index using F(i+1) = (F(i) * F(0)) mod m. If that index is within [0, n] range, access the element. If not go towards the next index.
Stop after m - 1 iterations (or when you obtain 1, it is the same thing).
Because m is prime, every number between 2 and m-1 is coprime to m so is a generator of the sequence {1 ... m}. You are guaranteed that no number will repeat in the first m - 1 steps, and that all m - 1 numbers will appear.
Complexity :
Step 2 : Done once, complexity equivalent to finding primes up to n, ie sieve of Eratosthenes
Step 3 : Done once, you can choose 2, 3 ,4, 5, etc... Which is as low as O(thread count)
Step 4 : O(m) time, O(1) in space per thread. You dont need to store the F(i). You only need to know first value and last value. This is the same properties as incrementation
If I understand well you want to generate a random permutation in a incremental way, i.e. you want to call n times a function f so that it generates all permuted numbers from 1 to n, so that function has constant memory.
I doubt it exists if you want to obtain an uniform distribution among the permutations, but you may be satisfied with a subset of the set of permutations.
If this is the case you can generate a permutation by taking a number p prime with n and calculate for each i in [1,n] : i.p (mod n).
For example, if you have n=5 and p=7, then 7%5=2, 14%5=4, 21%5=1, 28%5=3, 35%5=0. You may combine several such functions to obtain something satisfying for you...
If memory is your biggest problem then you'll have to swap CPU cycles for memory space.
E.g. c++'s std::vector<bool> (http://en.cppreference.com/w/cpp/container/vector_bool) is a bit-array so quite memory efficient.
Each thread could have its own vector<bool> indicating wether or not it has visited a particular index. Then you'd have to use CPU cycles to randomly choose an index that it hasn't visited yet and terminate when all bools are true.
It seems this guy solved your problem in a very nice way.
This is what he says in the first line of the post: In this post I’m going to show a way to make an iterator that will visit items in a list in a random order, only visit each item once, and tell you when it’s visited all items and is finished. It does this without storing a shuffled list, and it also doesn’t have to keep track of which items it has already visited.
He leverages the power of a variable bit-lenght block cipher algorithm to generate each and every index in the array.
This is not a complete answer but it should lead us to a correct solution.
You have written some things which we could take as assumptions:
(1) it is not very efficient since every thread needs to access its
crossref vector before accessing the shared one,
This is unlikely to be true. We're talking about one indirect lookup. Unless your reference data is really a vector of ints, this will represent an infinitesimal part of your execution time. If your reference data is a vector of ints, then just make N copies of it and shuffle them...
(2) i have some performances issue because of the amount of memory
required : the shared vector is very big and i have a lot of thread
and processors.
How big? Did you measure it? How many discrete objects are there in the vector? How big is each one?
How many threads?
How many processors?
How much memory do you have?
Have you profiled the code? Are you sure where the performance bottleneck is? Have you considered a more elegant algorithm?

What container really mimics std::vector in Haskell?

The problem
I'm looking for a container that is used to save partial results of n - 1 problems in order to calculate the nth one. This means that the size of the container, at the end, will always be n.
Each element, i, of the container depends on at least 2 and up to 4 previous results.
The container have to provide:
constant time insertions at either beginning or end (one of the two, not necessarily both)
constant time indexing in the middle
or alternatively (given a O(n) initialization):
constant time single element edits
constant time indexing in the middle
What is std::vector and why is it relevant
For those of you who don't know C++, std::vector is a dynamically sized array. It is a perfect fit for this problem because it is able to:
reserve space at construction
offer constant time indexing in the middle
offer constant time insertion at the end (with a reserved space)
Therefore this problem is solvable in O(n) complexity, in C++.
Why Data.Vector is not std::vector
Data.Vector, together with Data.Array, provide similar functionality to std::vector, but not quite the same. Both, of course, offer constant time indexing in the middle, but they offer neither constant time modification ((//) for example is at least O(n)) nor constant time insertion at either beginning of end.
Conclusion
What container really mimics std::vector in Haskell? Alternatively, what is my best shot?
From reddit comes the suggestion to use Data.Vector.constructN:
O(n) Construct a vector with n elements by repeatedly applying the generator function to the already constructed part of the vector.
constructN 3 f = let a = f <> ; b = f <a> ; c = f <a,b> in f <a,b,c>
For example:
λ import qualified Data.Vector as V
λ V.constructN 10 V.length
fromList [0,1,2,3,4,5,6,7,8,9]
λ V.constructN 10 $ (1+) . V.sum
fromList [1,2,4,8,16,32,64,128,256,512]
λ V.constructN 10 $ \v -> let n = V.length v in if n <= 1 then 1 else (v V.! (n - 1)) + (v V.! (n - 2))
fromList [1,1,2,3,5,8,13,21,34,55]
This certainly seems to qualify to solve the problem as you've described it above.
The first data structures that come to my mind are either Maps from Data.Map or Sequences from Data.Sequence.
Update
Data.Sequence
Sequences are persistent data structures that allow most operations efficient, while allowing only finite sequences. Their implementation is based on finger-trees, if you are interested. But which qualities does it have?
O(1) calculation of the length
O(1) insert at front/back with the operators <| and |> respectively.
O(n) creation from a list with fromlist
O(log(min(n1,n2))) concatenation for sequences of length n1 and n2.
O(log(min(i,n-i))) indexing for an element at position i in a sequence of length n.
Furthermore this structure supports a lot of the known and handy functions you'd expect from a list-like structure: replicate, zip, null, scans, sort, take, drop, splitAt and many more. Due to these similarities you have to do either qualified import or hide the functions in Prelude, that have the same name.
Data.Map
Maps are the standard workhorse for realizing a correspondence between "things", what you might call a Hashmap or associave array in other programming languages are called Maps in Haskell; other than in say Python Maps are pure - so an update gives you back a new Map and does not modify the original instance.
Maps come in two flavors - strict and lazy.
Quoting from the Documentation
Strict
API of this module is strict in both the keys and the values.
Lazy
API of this module is strict in the keys, but lazy in the values.
So you need to choose what fits best for your application. You can try both versions and benchmark with criterion.
Instead of listing the features of Data.Map I want to pass on to
Data.IntMap.Strict
Which can leverage the fact that the keys are integers to squeeze out a better performance
Quoting from the documentation we first note:
Many operations have a worst-case complexity of O(min(n,W)). This means that the operation can become linear in the number of elements with a maximum of W -- the number of bits in an Int (32 or 64).
So what are the characteristics for IntMaps
O(min(n,W)) for (unsafe) indexing (!), unsafe in the sense that you will get an error if the key/index does not exist. This is the same behavior as Data.Sequence.
O(n) calculation of size
O(min(n,W)) for safe indexing lookup, which returns a Nothing if the key is not found and Just a otherwise.
O(min(n,W)) for insert, delete, adjust and update
So you see that this structure is less efficient than Sequences, but provide a bit more safety and a big benefit if you actually don't need all entries, such the representation of a sparse graph, where the nodes are integers.
For completeness I'd like to mention a package called persistent-vector, which implements clojure-style vectors, but seems to be abandoned as the last upload is from (2012).
Conclusion
So for your use case I'd strongly recommend Data.Sequence or Data.Vector, unfortunately I don't have any experience with the latter, so you need to try it for yourself. From the stuff I know it provides a powerful thing called stream fusion, that optimizes to execute multiple functions in one tight "loop" instead of running a loop for each function. A tutorial for Vector can be found here.
When looking for functional containers with particular asymptotic run times, I always pull out Edison.
Note that there's a result that in a strict language with immutable data structures, there's always a logarithmic slowdown to implementing mutable data structure on top of them. It's an open problem whether the limited mutation hidden behind laziness can avoid that slowdown. There also the issue of persistent vs. transient...
Okasaki is still a good read for background, but finger trees or something more complex like an RRB-tree should be available "off-the-shelf" and solve your problem.
I'm looking for a container that is used to save partial results of n - 1 problems in order to calculate the nth one.
Each element, i, of the container depends on at least 2 and up to 4 previous results.
Lets consider a very small program. that calculates fibonacci numbers.
fib 1 = 1
fib 2 = 1
fib n = fib (n-1) + fib (n-2)
This is great for small N, but horrible for n > 10. At this point, you stumble across this gem:
fib n = fibs !! n where fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
You may be tempted to exclaim that this is dark magic (infinite, self referential list building and zipping? wth!) but it is really a great example of tying the knot, and using lazyness to ensure that values are calcuated as-needed.
Similarly, we can use an array to tie the knot too.
import Data.Array
fib n = arr ! 10
where arr :: Arr Int Int
arr = listArray (1,n) (map fib' [1..n])
fib' 1 = 1
fib' 2 = 1
fib' n = arr!(n-1) + arr!(n-2)
Each element of the array is a thunk that uses other elements of the array to calculate it's value. In this way, we can build a single array, never having to perform concatenation, and call out values from the array at will, only paying for the calculation up to that point.
The beauty of this method is that you don't only have to look behind you, you can look in front of you as well.

Efficiently "Hash" Set of Values to Indice

First of all, a disclaimer; hash is a somewhat inaccurate term for what I'm aiming for, please, feel free to suggest a better title.
At any rate, I'm currently attempting to program a complex spatial algorithm running in real-time. In order to save cycles, I've decided to generate a lookup table that contains all of the 32,000 possibilities.
If I were to do this conventionally, the values(Inclusive range and field count) 2x +0 -> +15 and 3x -2 -> +2 would be mapped to two four-bit and three three-bit values respectively, giving me a lookup-table size of 2 ^ (2*4 + 3*3) = 131,072 entries, a nearly 410% waste.
Given the nature of the algorithm, collisions would absolutely cripple its functionality (so no traditional hash functions unless I could guarantee no collisions with all relevant values). Beyond that, the structure I'm working with is rather large (ie, I would /really/ like to avoid allocating any more than 200% of what I need). Finally, since this table will be referenced so often, I'd like to avoid the overhead of a traditional hash-table in both bucket lookups and an excessively complex hash function.
Having taken a more traditional computer-science approach, I'm beginning to strongly believe the solution lies in some mathematics of base-conversion I'm completely ignorant of. Any idea if this is the case?
You can calculate an index the same way you calculated the maximum number of combinations, by multiplying each element. Take each element from most significant to least significant, add a constant to make it range from 0 to n-1, and multiply by the number of combinations remaining.
Given your 0 to 15 values of a, b (range of 16) and -2 to +2 values of c, d, e (range of 5):
index = a * 16*5*5*5 + b * 5*5*5 + (c+2) * 5*5 + (d+2) * 5 + (e+2);

Data structure with fast contiguous ranges retrieval

Imagine data structure, that manipulates some contiguous container, and allows quick retrieval of contiguous ranges of indices, within this array, that contains data (and probably free ranges too). Let's call this ranges "blocks". Each block knows its head and tail index:
struct Block
{
size_t begin;
size_t end;
}
When we manipulating array, our data structure updates blocks:
array view blocks [begin, end]
--------------------------------------------------------------
0 1 2 3 4 5 6 7 8 9 [0, 9]
pop 2 block 1 splitted
0 1 _ 3 4 5 6 7 8 9 [0, 1] [3, 9]
pop 7, 8 block 2 splitted
0 1 _ 3 4 5 6 _ _ 9 [0, 1] [3, 6] [9, 9]
push 7 changed end of block 3
0 1 _ 3 4 5 6 7 _ 9 [0, 1] [3, 7] [9, 9]
push 5 error: already in
0 1 _ 3 4 5 6 7 _ 9 [0, 1] [3, 7] [9, 9]
push 2 blocks 1, 2 merged
0 1 2 3 4 5 6 7 _ 9 [0, 7] [9, 9]
Even before profiling, we know that blocks retrieval speed will be cornerstone of application performance.
Basically usage is:
very often retrieval of contiguous blocks
quite rare insertions/deletions
most time we want number of blocks be minimal (prevent fragmentation)
What we have already tried:
std::vector<bool> + std::list<Block*>. On every change: write true/false to vector, then traverse it in for loop and re-generate list. On every query of blocks return list. Slower than we wanted.
std::list<Block*> update list directly, so no traversing. Return list. Much code to debug/test.
Questions:
Is that data structure has some generic name?
Is there already such data structures implemented (debugged and tested)?
If no, what can you advice on fast and robust implementation of such data structure?
Sorry if my explanation is not quite clear.
Edit
Typical application for this container is managing buffers: either system or GPU memory. In case of GPU we can store huge amounts of data in single vertex buffer, and then update/invalidate some regions. On each draw call we must know first and last index of each valid block in buffer to draw (very often, tenth to hundreds times per second) and sometimes (once a second) we must insert/remove blocks of data.
Another application is a custom "block memory allocator". For that purpose, similar data structure implemented in "Alexandrescu A. - Modern C++ Design" book via intrusive linked list. I'm looking for better options.
What I see here is a simple binary tree.
You have pairs (blocks) with a begin and an end indices, that is, pairs (a,b) where a <= b. So the set of blocks can be easily ordered and stored in a search-binary-tree.
Searching the block wich corresponds to a given number is easy (Just the tipical bynary-tree-search). So when you delete a number from the array, you need to search the block that corresponds to the number and split it in two new blocks. Note that all blocks are leaves, the internal nodes are the intervals wich the two child nodes forms.
Insertion on the other hand means searching the block, and test its brothers to know if the brothers have to be collapsed. This should be done recursively up through the tree.
You may want to try a tree like structure, either a simple red-black tree or a B+ tree.
Your first solution (vector of bools + list of blocks) seems like a good direction, but note that you don't need to regenerate the list completely from scratch (or go over the entire vector) - you just need to traverse the list until you find where the newly changed index should be fixed, and split/merge the appropriate blocks on the list.
If the list traversal proves too long, you could implement instead a vector of blocks, where each block is mapped to its start index, and each hole has a block saying where the hole ends. You can traverse this vector as fast as a list since you always jump to the next block (one O(1) lookup to determine the end of the block, another O(1) lookup to determine the beginning of the next block. The benefit however is that you can also access indices directly (for push/pop), and figure out their enclosing block with a binary search.
To make it work, you'll have to do some maintenance work on the "holes" (merge and split them like real blocks), but that should also be O(1) on any insertion/deletion. The important part is that there's always a single hole between blocks, and vice-versa
Why are you using a list of blocks? Do you need stable iterators AND stable references? boost::stable_vector may help. If you don't need stable references, maybe what you want is to write a wrapper container that contains a std::vector blocks and a secondary memory map of size blocks.capacity() which is a map from iterator index (which is kept inside returned iterators to real offset in the blocks vector) and a list of currently unused iterator indices.
Whenever you erase members from blocks, you repack blocks and shuffle the map accordingly for increased cache coherence, and when you want to insert, just push_back to blocks.
With block packing, you get cache coherence when iterating at the cost of deletion speed. And maintain relatively fast insert times.
Alternatively, if you need stable references and iterators, or if the size of the container is very large, at the cost of some access speed, iteration speed, and cache coherency, you wrap each entry in the vector in a simple structure that contains the real entry and an offset to the next valid, or just store pointers in the vector and have them at null on deletion.

Efficient partial reductions given arrays of elements, offsets to and lengths of sublists

For my application I have to handle a bunch of objects (let's say ints) that gets subsequently divided and sorted into smaller buckets. To this end, I store the elements in a single continuous array
arr = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14...}
and the information about the buckets (sublists) is given by offsets to the first element in the respective bucket and the lengths of the sublist.
So, for instance, given
offsets = {0,3,8,..}
sublist_lengths = {3,5,2,...}
would result in the following splits:
0 1 2 || 3 4 5 6 7 || 8 9 || ...
What I am looking for is a somewhat general and efficient way to run algorithms, like reductions, on the buckets only using either custom kernels or the thrust library. Summing the buckets should give:
3 || 25 || 17 || ...
What I've come up with:
option 1: custom kernels require a quite a bit of tinkering, copies into shared memory, proper choice of block and grid sizes and an own implementation of the algorithms, like scan, reduce, etc. Also, every single operation would require an own custom kernel. In general it is clear to me how to do this, but after having used thrust for the last couple of days I have the impression that there might be a smarter way
option 2: generate an array of keys from the offsets ({0,0,0,1,1,1,1,1,2,2,3,...} in the above example) and use thrust::reduce_by_key. I don't like the extra list generation, though.
option 3: Use thrust::transform_iterator together with thrust::counting_iterator to generate the above given key list on the fly. Unfortunately, I can't come up with an implementation that doesn't require increments of indices to the offset list on the device and defeats parallelism.
What would be the most sane way to implement this?
Within Thrust, I can't think of a better solution than Option 2. The performance will not be terrible, but it's certainly not optimal.
Your data structure bears similarity to the Compressed Sparse Row (CSR) format for storing sparse matrices, so you could use techniques developed for computing sparse matrix-vector multiplies (SpMV) for such matrices if you want better performance. Note that the "offsets" array of the CSR format has length (N+1) for a matrix with N rows (i.e. buckets in your case) where the last offset value is the length of arr. The CSR SpMV code in Cusp is a bit convoluted, but it serves as a good starting point for your kernel. Simply remove any reference to Aj or x from the code and pass offsets and arr into the Ap and Av arguments respectively.
You didn't mention how big the buckets are. If the buckets are big enough, maybe you can get away with copying the offsets and sublist_lengths to the host, iterating over them and doing a separate Thrust call for each bucket. Fermi can have 16 kernels in flight at the same time, so on that architecture you might be able to handle smaller buckets and still get good utilization.