Why is Minimax with Alpha-Beta pruning expanding more nodes on the second turn of Connect Four? - minimax

I have a program which plays Connect Four against a human opponent using either standard Minimax algorithm or minimax with alpha-beta pruning. Both algorithms have a depth limit, after which they apply an evaluation function. As part of the project I had to create performance curves showing the number of nodes of the search tree expanded by the algorithms, per turn of the computer. The curves show a downward trend, as expected, since the number of possible states goes down as the game progresses. However, I can't explain why the number of nodes increases in the second turn of the computer, most prominently in the alpha-beta case, as can be seen in the images below:
These curves were built based on a test game where the human played first and a depth limit of 8 plies. Does anyone know why the curves are not strictly decreasing?

Minimax can see an increase in the number of nodes depending on how the game branches at different plies. Consider a silly game where, in the first 8 plies, both players are forced to "do nothing", and on the 9th ply, they get many more options. If your depth is 8 plies, you'll certainly see more nodes expanded once the minimax is allowed to reach the 9th ply.
This is also true in alpha-beta pruning. But furthermore, in alpha-beta pruning, the order of node evaluation affects the number of nodes expanded. In other words, it's a bit more random.

Related

How to make an alpha-beta search-based game engine non-deterministic?

I successfully implemented a negascout game engine, which works well, but deterministically. That means I can replay the same game over and over again, because for a given position, the game engine yields the same best move every time. This is unwanted in my case, because I want to compete with my algorithm in coding tournaments and with the deterministic behavior, an opponent can easily write program that wins by just replaying a sequence of winning moves against my program.
My question is, what is the most efficient and elegant way to make it less deterministic? I could add a random offset to my position evaluation, but I'm afraid this could worsen the evaluation quality. Is there a standard way to do this?
Just start from another random open position. Dont add randomness to your engine until you've worked out the bugs. If two or more moves are equal, you can randomise those in the move ordering.

Best graph algorithm for least transfer in an electric grid

I'm given a series of cities, and each one produces an amount of electricity and needs an amount of electricity. Each city has up to 8 adjacent cities, and I am trying to minimize the number of transfers.
If A->B 10 energy, total cost of transfer is 10.
If A->B->C 10 energy (A to C through B), total cost of transfer is 20.
I thought about using Djikstra's on each point that needs energy, and ending the search for that point when enough energy has been found, but thought of several pitfalls.
I was wondering what else I could consider that could potentially work?
I also considered looking into the Floyd-Warshall algorithm as well as the Hagerup (read a bit about them on wikipedia and they seemed potentially viable)
Thanks
Your problem is easily reduced to a well-known minimum-cost flow problem:
The minimum-cost flow problem (MCFP) is to find the cheapest possible
way of sending a certain amount of flow through a flow network.
This reduction can be done the following way. Add a dummy "source" and "sink" vertices to your graph, add directed edge from source to each original vertex with capacity equal to production rate at that vertex, add a directed edge from each original vertex to sink with capacity equal to consumption rate at that vertex. Set capacities and costs on your original edges as you need them, and solve the max-flow min-cost problem on the resulting network.
I also doubt that Dijkstra algorithm or any shortest-path algorithm will be of any use, as they are concerned with the path of only one unit of electricity from a particular city, and do not take into account "interference" effects from electricity produced in different cities. For example, if you have two cities (A and B) producing 1 unit of energy, one more city (C) close to both A and B consuming 1 unit of energy, and one more city (D) far away consuming 1 unit of energy, then you will have to route energy from either A either B to D, but no shortest-path algorithm will offer you this.
Ending the search as soon as you have enough energy isn't guaranteed to find the shortest path, but letting Dijkstra run completely for each point that's a power consumer will, and is probably still reasonable to do computationally depending on the size of the network.
Lookup A* algorithm it improves on dijkstra with heuristics which might remove some pitfalls.
I can't really think of any other algorithm.
Actually I think A* should be fine.

Alpha-Beta "breaking" the Amdahl's law?

I have a classic minimax problem solver with additional alpha-beta pruning implementation.
I parallelized the algorithm in the following way:
Do iterative deepening until we have more nodes than available threads
Run one minimax per thread in batches of N threads. So if we get 9 possible moves at depth 2 from the serial search, we first start 4 threads, then another 4 and then 1 on the end, each starting at depth 2 with their own parameters.
It turns out that the speedup S=T(serial)/T(parallel) for 4 threads is 4.77 so I am basically breaking Amdahl's law here.
If we say that implementation is not broken in some way, I suspect Alpha-Beta pruning is doing the magic here? Due to starting several searches in parallel, there is more pruning and sooner? That is my theory but I'd love if someone could confirm this in more detail.
Just to clarify:
Minimax without alpha-beta implementation is basically doing depth-first search of the whole tree up to some max depth.
With alpha-beta it's doing the same except it prunes some branches which will lead to a worse result anyway.
Edit: After further examination of the code I had a bug on one line of code which caused the program to "cheat" and not follow some moves. Actual speedup factor is 3.6 now. Sorry for wasting everyone's time.. no breakthrough in computing today. :/
This can be due to cache effect or similar. It is called superlinear speedup. It can/does happen.
Using more threads you are effectively running a partial breadth-first search. It just happens that your problem is amenable to breadth-first search.
Even on a single-core machine you would see a speedup.
You don't need threads to achieve this speedup. You can simply program a (partial) breadth-first search that behaves like multiple threads would.
Imagine you want to search two lists:
1 million times 0, then 1
1, then 1 million times 0
And you stop as soon as you find 1. If you search them sequentially you need to look at 1,000,002 elements. If you use two threads on a single core the search will immediately find a 1 and you're done. A "superlinear" speedup of 1,000,000x or so!

Neural Networks training on multiple cores

Straight to the facts.
My Neural network is a classic feedforward backpropagation.
I have a historical dataset that consists of:
time, temperature, humidity, pressure
I need to predict next values basing on historical data.
This dataset is about 10MB large therefore training it on one core takes ages. I want to go multicore with the training, but i can't understand what happens with the training data for each core, and what exactly happens after cores finish working.
According to: http://en.wikipedia.org/wiki/Backpropagation#Multithreaded_Backpropagation
The training data is broken up into equally large batches for each of
the threads. Each thread executes the forward and backward
propagations. The weight and threshold deltas are summed for each of
the threads. At the end of each iteration all threads must pause
briefly for the weight and threshold deltas to be summed and applied
to the neural network.
'Each thread executes forward and backward propagations' - this means, each thread just trains itself with it's part of the dataset, right? How many iterations of the training per core ?
'At the en dof each iteration all threads must pause briefly for the weight and threshold deltas to be summed and applied to neural network' - What exactly does that mean? When cores finish training with their datasets, wha does the main program do?
Thanks for any input into this!
Complete training by backpropagation is often not the thing one is really looking for, the reason being overfitting. In order to obtain a better generalization performance, approaches such as weight decay or early stopping are commonly used.
On this background, consider the following heuristic approach: Split the data in parts corresponding to the number of cores and set up a network for each core (each having the same topology). Train each network completely separated of the others (I would use some common parameters for the learning rate, etc.). You end up with a number of http://www.texify.com/img/%5Cnormalsize%5C%21N_%7B%5Ctext%7B%7D%7D.gif
trained networks http://www.texify.com/img/%5Cnormalsize%5C%21f_i%28x%29.gif.
Next, you need a scheme to combine the results. Choose http://www.texify.com/img/%5Cnormalsize%5C%21F%28x%29%3D%5Csum_%7Bi%3D1%7D%5EN%5C%2C%20%5Calpha_i%20f_i%28x%29.gif, then use least squares to adapt the parameters http://www.texify.com/img/%5Cnormalsize%5C%21%5Calpha_i.gif such that http://www.texify.com/img/%5Cnormalsize%5C%21%5Csum_%7Bj%3D1%7D%5EM%20%5C%2C%20%5Cbig%28F%28x_j%29%20-%20y_j%5Cbig%29%5E2.gif is minimized. This involves a singular value decomposition which scales linearly in the number of measurements M and thus should be feasible on a single core. Note that this heuristic approach also bears some similiarities to the Extreme Learning Machine. Alternatively, and more easily, you can simply try to average the weights, see below.
Moreover, see these answers here.
Regarding your questions:
As Kris noted it will usually be one iteration. However, in general it can be also a small number chosen by you. I would play around with choices roughly in between 1 and 20 here. Note that the above suggestion uses infinity, so to say, but then replaces the recombination step by something more appropriate.
This step simply does what it says: it sums up all weights and deltas (what exactly depends on your algoithm). Remember, what you aim for is a single trained network in the end, and one uses the splitted data for estimation of this.
To collect, often one does the following:
(i) In each thread, use your current (global) network weights for estimating the deltas by backpropagation. Then calculate new weights using these deltas.
(ii) Average these thread-local weights to obtain new global weights (alternatively, you can sum up the deltas, but this works only for a single bp iteration in the threads). Now start again with (i) in which you use the same newly calculated weights in each thread. Do this until you reach convergence.
This is a form of iterative optimization. Variations of this algorithm:
Instead of using always the same split, use random splits at each iteration step (... or at each n-th iteration). Or, in the spirit of random forests, only use a subset.
Play around with the number of iterations in a single thread (as mentioned in point 1. above).
Rather than summing up the weights, use more advanced forms of recombination (maybe a weighting with respect to the thread-internal training-error, or some kind of least squares as above).
... plus many more choices as in each complex optimization ...
For multicore parallelization it makes no sense to think about splitting the training data over threads etc. If you implement that stuff on your own you will most likely end up with a parallelized implementation that is slower than the sequential implementation because you copy your data too often.
By the way, in the current state of the art, people usually use mini-batch stochastic gradient descent for optimization. The reason is that you can simply forward propagate and backpropagate mini-batches of samples in parallel but batch gradient descent is usually much slower than stochastic gradient descent.
So how do you parallelize the forward propagation and backpropagation? You don't have to create threads manually! You can simply write down the forward propagation with matrix operations and use a parallelized linear algebra library (e.g. Eigen) or you can do the parallelization with OpenMP in C++ (see e.g. OpenANN).
Today, leading edge libraries for ANNs don't do multicore parallelization (see here for a list). You can use GPUs to parallelize matrix operations (e.g. with CUDA) which is orders of magnitude faster.

How could a minimax algorithm be more optimistic?

Minimax seems to do a great job of not losing, but it's very fatalistic in assuming the opponent won't make a mistake. Of course a lot of games are solved to a draw, but one should be playing for "Push as hard as possible for a win without risking losing", even when no forced wins are available.
That is, given two trees with the same (drawn) end position given optimal play, how could the algorithm be adjusted to prefer the one which is most likely to win if the opponent makes a sub-optimal move, or make the opponent more likely to slip up?
Using the simple example of Tic-Tac-Toe, a stronger player would often aim to set up forks and thereby guarantee wins. Even though the opponent could see such a trick coming and stop it beforehand, they're more likely to miss that than if you just put two Xs in an empty row and hope they momentarily forget what game they're playing. Similarly a strong player would tend to start in the centre or perhaps a corner, but in simple minimax there's no reason (since you can still force a draw) not to pick an edge square.
If I understand your question correctly, you're asking how to adjust the minimax algorithm so that it doesn't assume the opponent always makes the best move.
Look into the expectiminimax algorithm, a variation of minimax.
Essentially, instead of dealing with only min or max nodes, it introduces chance nodes which store a probability that the opponent will choose the current move.
To make it even simpler, you could assume the opponent selects each move (node) with equal probability.
In short, when its the opponents turn, instead of returning the minimum score, return the average score of their possible moves.
How about tweaking "min" nodes?
In regular minimax, when evaluating a position for the opponent, the score is the minimum score for each of his moves. Injecting some optimism (from the "max" player's pov) into the search could be done by using a different function than the minimum.
Some things that could be tried out:
-using the second worst score
-using a mix between the min and the average (or median)
Perhaps this should be tied to an optimism factor that increases with the depth of the node. This would avoid ignoring a very bad move by the opponent lower in the tree (which in most games would mean a more obvious move).