Something wrong with BFS maze solving algorithm in OCaml - ocaml
The above link contains a program I wrote to solve mazes using a BFS algorithm. The maze is represented as a 2D array, initially passed in as numbers, (0's represent an empty block which can be visited, any other number represent a "wall" block), and then converted into a record type which I defined, which keeps track of various data:
type mazeBlock = {
walkable : bool;
isFinish : bool;
visited : bool;
prevCoordinate : int * int
The output is a list of ordered pairs (coordinates/indices) which trace a shortest path through the maze from the start to the finish, the coordinates of which are both passed in as parameters.
It works fine for smaller mazes with low branching factor, but when I test it on larger mazes (say 16 x 16 or larger), especially on ones with no walls(high branching factor) it takes up a LOT of time and memory. I am wondering if this is inherent to the algorithm or related to the way I implemented it. Can any OCaml hackers out there offer me their expertise?
Also, I have very little experience with OCaml so any advice on how to improve the code stylistically would be greatly appreciated. Thanks!
Here is an cleaned-up, edited version of the program. I fixed some stylistic issues, but I didn't change the semantics. As usual, the second test still takes up a huge amount of resources and cannot seem to finish at all. Still seeking help on this issue...
SOLVED. Thanks so much to both answerers. Here is the final code:

In your critical section, that is mazeSolverLoop, you should only visited elements that have not been visited before. When you take the element from the queue, you should first check if the element has been visited, and in that case do nothing but recurse to get the next element. This is precisely what makes the good time complexity of the algorithm (you never visit a place twice).
Otherwise, yes, your OCaml style could be improved. Some remarks:
the convention in OCaml-land is rather to write_like_this instead of writeLikeThis. I recommend that you follow it, but admittedly that is a matter of taste and not an objective criterion.
there is no point in returning a datastructure if it is a mutable structure that was updated; why do you make a point to always return a (grid, pair) queue, when it is exactly the same as the input? You could just have those functions return unit and have code that is simpler and easier to read.
the abstraction level allowed by pairs is good and you should preserve it; you currently don't. There is no point in writing for example, let (foo, bar) = dimension grid in if in_bounds pos (foo, bar). Just name the dimension dim instead of (foo, bar), it makes no sense to split it in two components if you don't need them separately. Remark that for the neighbor, you do use neighborX and neighborY for array access for now, but that is a style mistake: you should have auxiliary functions to get and set values in an array, taking a pair as input, so that you don't have to destruct the pair in the main function. Try to keep all the code inside a single function at the same level of abstraction: all working on separate coordinates, or all working on pairs (named as such instead of being constructed/deconstructed all the time).

If I understand you right, for an N x N grid with no walls you have a graph with N^2 nodes and roughly 4*N^2 edges. These don't seem like big numbers for N = 16.
I'd say the only trick is to make sure you track visited nodes properly. I skimmed your code and don't see anything obviously wrong in the way you're doing it.
Here is a good OCaml idiom. Your code says:
let isFinish1 = mazeGrid.(currentX).(currentY).isFinish in
let prevCoordinate1 = mazeGrid.(currentX).(currentY).prevCoordinate in
mazeGrid.(currentX).(currentY) <-
{ walkable = true;
isFinish = isFinish1;
visited = true;
prevCoordinate = prevCoordinate1}
You can say this a little more economically as follows:
mazeGrid.(currentX).(currentY) <-
{ mazeGrid.(currentX).(currentY) with visited = true }


How can I remove too close points in a list

I have a list of points with x,y coordinates:
List_coord=[(462, 435), (491, 953), (617, 285),(657, 378)]
This list lenght (4 element here) can be very large from few hundred up to 35000 elements.
I want to remove too close points by threshold in this list.
note:Points are never at the exact same position.
My current code for that:
while iteration<5:
for pt in List_coord:
for PT in List_coord:
if (abs(pt[0]-PT[0])+abs(pt[1]-PT[1]))!=0 and abs(pt[0]-PT[0])<threshold and abs(pt[1]-PT[1])<threshold:
Explication of my terrible code :) :
I check if the very distance is 0 then it means that i am comparing
the same point
then i check the distance in x and in y..
I need few iterations to avoid missing one remove because the list change inside the loop itself...
This code is working but it is a very low process!
I am sure there is another method much easier but i wasn't able to find even if some allready answered questions are close to mine..
note:I would like to avoid using extra library for that code if it is possible
Python will be a bit slow at this ;-)
The solution you will probably want is called quad-trees, but I'll mention a simpler approach first, in case it's preferable.
The usual approach is to group the points so that you can easily reject points that are clearly far away from each other.
One approach might be to sort the list twice, once by x once by y. You can prove that if two points are too-close, they must be close in one dimension or the other. Thus your inner loop can break out early. If it sees a point that is too far away from the outer point in the sorted direction, it can know for a fact that all future points in that list are also too far away. Thus it doesn't have to look any further. Do this in X and Y and you're set!
This approach is going to tend to be dominated by the O(n log n) sort times. However, if all of your points share a single x value, you'll end up doing the same slow O(n^2) iteration that you're doing right now because you never terminate the inner loop early.
The more robust solution is to use quadtrees. Quadtrees are designed to solve the kind of problem you are looking at. The idea is to build a tree such that you can rapidly exclude large numbers of points. I'd recommend this.
If your number of points gets too large, I'd recommend getting a clustering library. Efficient clustering is a very difficult task, and often done in C++ or another fast language.

How to efficiently add and remove vector values C++

I am trying to efficiently calculate the averages from a vector.
I have a matrix (vector of vectors) where:
row: the days I am going back (250)
column: the types of things I am calculating the average of (10,000 different things)
Currently I am using .push_back() which essentially iterates through each row in each column and then I use erase() in order to remove the last value. As this method goes through all the values, my code is very slow.
I am thinking of a method linked to substitution, however I have a hard time implementing the idea, as all the values have an order (i.e. I need to remove the old value and the value I add / substitute will be the newest).
Below is my code so far.
Any ideas for a solution or guides for the right direction will be much appreciated.
vector <vector<float> > vectorOne;
vectorOne(250, vector<float>(10000, 0)),
//This is the slow method
vectorOne[column].push_back(1);//add newest value
vectorOne[column].erase(vectorOne[column].begin() + 0); //remove latest value
You probably need a different data structure.
The problem sounds like a queue. You add to the end and take from the front. With real queues, everyone then shuffles up a step. With computer queues, we can use a circular buffer (you do need to be able to get a reasonable bound on maximum queue length).
I suggest implementing your own on top of a plain C array first, then using the STL version when you've understood the principle.

Remove duplicates algorithm

I'm trying to write an algorithm to remove duplicates from a vector<struct xxxx*>.
struct xxxx{
int value; // This is just to make you understand
xxxx* one;
xxxx* two;
As you see my struct it's like a tree but the pointers are not in order. The pointers can point to any(actually not any but most) of the others. And the vector doesn't contain the structs but pointers, so I couldn't use the std algorithms to help me neither.
I'm trying to delete duplicates with exactly same value and the same two pointers, but in the same time if I have two similar structs (Let's say A and B) and or C.two points to B. Then I need to change it to A and viceversa.
In other words: if A == B then remove B and change to point A.
I think I can write the brute-force, so if there's no better algorithm I'll write it by myself.
Yesterday, I tried to explain the reasonable approach to a very similar problem to a coworker who had used an N squared solution to an N log N problem.
First create a helper struct, that is basically a wrapper around an xxxx* with a comparison operator checking the contents (not the pointer value) and probably with some other utility functions. This wrapper struct isn't strictly needed vs. just using xxxx*, but from experience, I think it makes the task cleaner.
Create a std::set of those helper structs, into which you will only insert unique elements, and likely another set into which you will insert recursively unresolved elements.
Loop through the original vector and at each position recurse through its children. If you hit a child already in the unique set, that is a final value for that child pointer. If you hit a child that matches a unique element without being the one it matches, then fix the pointer that got you there. If there is also the possibility of null pointers that should bottom the recursion, and if loops are possible you need to detect them (with that recursively unresolved set) and some decision about what to do with a loop. At some point you hit resolved unique elements and add that to the unique set.
The performance and maybe even soundness of the idea depends on the depth and complexity of the loops and what you want to do with loops. There are some messy cases where a loop would map onto another loop, but detecting that could be very tricky. If your phase "like a tree" meant "no loops" then the recursion bottoms cleanly and efficiently without the extra complexity of explicitly managing the recursively unresolved elements.
Obviously I left out some of the grunt work detail around detecting unique / non-unique as you back out of the recursion and around detecting "already did it during an earlier recursion" as you hit an item in the main loop above the recursion. But all those details should be pretty obvious as you write the relevant parts of the code.
Edit: To understand how few node visits there are despite nesting a recursion inside a sequential loop, think from the point of view of the pointers. We follow each pointer at most once (some duplicates are pre detected without following their pointers). For N nodes, there are N top level pointers (if I understood your description correctly) and significantly less than 2N internal pointers (the more tree-like it is, the closer it will be to N-1 internal pointers, rather than 2N). So each node is visited on average less than 3 times and a minority of those visits require both the pre lookup and the post recursion lookup, and each lookup is log U where U is the number of unique items found up to that point. So we can trivially see a bound of 6 N log N.

Hard sorting problem - what type of algorithm should I be using?

The problem:
N nodes are related to each other by a 'closeness' factor ranging from 0 to 1, where a factor of 1 means that the two nodes have nothing in common and 0 means the two nodes are exactly alike.
If two nodes are both close to another node (i.e. they have a factor close to 0) then this doesn't mean that they will be close together, although probabilistically they do have a much higher chance of being close together.
The question:
If another node is placed in the set, find the node that it is closest to in the shortest possible amount of time.
This isn't a homework question, this is a real world problem that I need to solve - but I've never taken any algorithm courses etc so I don't have a clue what sort of algorithm I should be researching.
I can index all of the nodes before another one is added and gather closeness data between each node, but short of comparing all nodes to the new node I haven't been able to come up with an efficient solution. Any ideas or help would be much appreciated :)
Because your 'closeness' metric obeys the triangle inequality, you should be able to use a variant of BK-Trees to organize your elements. Adapting them to real numbers should simply be a matter of choosing an interval to quantize your number on, and otherwise using the standard Bk-Tree procedure. Some experimentation may be required - you might want to increase the resolution of the quantization as you progress down the tree, for instance.
but short of comparing all nodes to
the new node I haven't been able to
come up with an efficient solution
Without any other information about the relationships between nodes, this is the only way you can do it since you have to figure out the closeness factor between the new node and each existing node. A O(n) algorithm can be a perfectly decent solution.
One addition you might consider - keep in mind we have no idea what data structure you are using for your objects - is to organize all present nodes into a graph, where nodes with factors below a certain threshold can be considered connected, so you can first check nodes that are more likely to be similar/related.
If you want the optimal algorithm in terms of speed, but O(n^2) space, then for each node create a sorted list of other nodes (ordered by closeness).
When you get a new node, you have to add it to the indexed list of all the other nodes, and all the other nodes need to be added to its list.
To find the closest node, just find the first node on any node's list.
Since you already need O(n^2) space (in order to store all the closeness information you need basically an NxN matrix where A[i,j] represents the closeness between i and j) you might as well sort it and get O(1) retrieval.
If this closeness forms a linear spectrum (such that closeness to something implies closeness to other things that are close to it, and not being close implies not being close to those close), then you can simply do a binary or interpolation sort on insertion for closeness, handling one extra complexity: at each point you have to see if closeness increases or decreases below or above.
For example, if we consider letters - A is close to B but far from Z - then the pre-existing elements can be kept sorted, say: A, B, E, G, K, M, Q, Z. To insert say 'F', you start by comparing with the middle element, [3] G, and the one following that: [4] K. You establish that F is closer to G than K, so the best match is either at G or to the left, and we move halfway into the unexplored region to the left... 3/2=[1] B, followed by E, and we find E's closer to F, so the match is either at E or to its right. Halving the space between our earlier checks at [3] and [1], we test at [2] and find it equally-distant, so insert it in between.
EDIT: it may work better in probabilistic situations, and require less comparisons, to start at the ends of the spectrum and work your way in (e.g. compare F to A and Z, decide it's closer to A, see if A's closer or the halfway point [3] G). Also, it might be good to finish with a comparison to the closest few points either side of where the binary/interpolation led you.
ACM Surveys September 2001 carried two papers that might be relevant, at least for background. "Searching in Metric Spaces", lead author Chavez, and "Searching in High Dimensional Spaces - Index Structures for Improving the Performance of Multimedia Databases", lead author Bohm. From memory, if all you have is the triangle inequality, you can use it to some effect, but if you can trim your data down to a sensible number of dimensions, you can do better by using a search structure that knows about this dimensional structure.
Facebook has this thing where it puts you and all of your friends in a graph, then slowly moves everyone around until people are grouped together based on mutual friends and so on.
It looked to me like they just made anything <0.5 an attractive force, anything >0.5 a repulsive force, and moved people with every iteration based on the net force. After a couple hundred iterations, it was looking pretty darn good.
Note: this is not an algorithm it is a heuristic. In the facebook implementation I saw, two people were not able to reach equilibrium and kept dancing around each other. It turns out they were actually the same person with two different accounts.
Also, it took about 15 minutes on a decent computer and ~100 nodes. YMMV.
It looks suspiciously like a Nearest Neighbor Search problem (also called a similarity search)

how to create a 20000*20000 matrix in C++

I try to calculate a problem with 20000 points, so there is a distance matrix with 20000*20000 elements, how can I store this matrix in C++? I use Visual Studio 2008, on a computer with 4 GB of RAM. Any suggestion will be appreciated.
A sparse matrix may be what you looking for. Many problems don't have values in every cell of a matrix. SparseLib++ is a library which allows for effecient matrix operations.
Avoid the brute force approach you're contemplating and try to envision a solution that involves populating a single 20000 element list, rather than an array that covers every possible permutation.
For starters, consider the following simplistic approach which you may be able to improve upon, given the specifics of your problem:
int bestResult = -1; // some invalid value
int bestInner;
int bestOuter;
for ( int outer = 0; outer < MAX; outer++ )
for ( int inner = 0; inner < MAX; inner++ )
int candidateResult = SomeFunction( list[ inner ], list[ outer ] );
if ( candidateResult > bestResult )
bestResult = candidateResult;
bestInner = inner;
bestOuter = outer;
You can represent your matrix as a single large array. Whether it's a good idea to do so is for you to determine.
If you need four bytes per cell, your matrix is only 4*20000*20000, that is, 1.6GB. Any platform should give you that much memory for a single process. Windows gives you 2GiB by default for 32-bit processes -- and you can play with the linker options if you need more. All 32-bit unices I tried gave you more than 2.5GiB.
Is there a reason you need the matrix in memory?
Depending on the complexity of calculations you need to perform you could simply use a function that calculates your distances on the fly. This could even be faster than precalculating ever single distance value if you would only use some of them.
Without more references to the problem at hand (and the use of the matrix), you are going to get a lot of answers... so indulge me.
The classic approach here would be to go with a sparse matrix, however the default value would probably be something like 'not computed', which would require special handling.
Perhaps that you could use a caching approach instead.
Apparently I would say that you would like to avoid recomputing the distances on and on and so you'd like to keep them in this huge matrix. However note that you can always recompute them. In general, I would say that trying to store values that can be recomputed for a speed-off is really what caching is about.
So i would suggest using a distance class that abstract the caching for you.
The basic idea is simple:
When you request a distance, either you already computed it, or not
If computed, return it immediately
If not computed, compute it and store it
If the cache is full, delete some elements to make room
The practice is a bit more complicated, of course, especially for efficiency and because of the limited size which requires an algorithm for the selection of those elements etc...
So before we delve in the technical implementation, just tell me if that's what you're looking for.
Your computer should be able to handle 1.6 GB of data (assuming 32bit)
size_t n = 20000;
typedef long dist_type; // 32 bit
std::vector <dist_type> matrix(n*n);
And then use:
dist_type value = matrix[n * y + x];
You can (by using small datatypes), but you probably don't want to.
You are better off using a quad tree (if you need to find the nearest N matches), or a grid of lists (if you want to find all points within R).
In physics, you can just approximate distant points with a field, or a representative amalgamation of points.
There's always a solution. What's your problem?
Man you should avoid the n² problem...
Put your 20 000 points into a voxel grid.
Finding closest pair of points should then be something like n log n.
As stated by other answers, you should try hard to either use sparse matrix or come up with a different algorithm that doesn't need to have all the data at once in the matrix.
If you really need it, maybe a library like stxxl might be useful, since it's specially designed for huge datasets. It handles the swapping for you almost transparently.
Thanks a lot for your answers. What I am doing is to solve a vehicle routing problem with about 20000 nodes. I need one matrix for distance, one matrix for a neighbor list (for each node, list all other nodes according to the distance). This list will be used very often to find who can be some candidates. I guess sometimes distances matrix can be ommited if we can calculate when we need. But the neighbor list is not convenient to create every time. the list data type could be int.
To mgb:
how much can a 64 bit windows system help this situation?