Multidimensional array v. Sorting multiple arrays - c++

I've hit a snag in continuing my work in a C++ program, I'm not sure what the best way to approach my problem is. Here is the situation in non-programming terms: I have a list of children and each child has a specific weight, age, and happiness. I have a way that people can visually view the bones of the child that is specific to these characteristics. (Think of an MMO character customization where there are sliders for each characteristic and when you slide the weight slider to heavy, the walk cycle looks like the character is heavier).
Before, my system had a set walk cycle for each end of the spectrum for each characteristic. For example, there is one specific walk cycle for the heaviest walk, one for the lightest walk, one for youngest walk, etc. There was not middle input, the output was the position of the slider on the scale and the heaviest walk cycle and the lightest walk cycle were averaged by a specific percentage, the position of the slider.
Now to the problem, I have a large library of preset walk cycles and each walk cycle has a specific weight, age, and happiness. So, Joe has a weight of 4, an age of 7, and happiness level of 8 and Sally 2, 3, 5. When the sliders move to a the specific value (weight 5, age 8, happiness 7). However, only one slider can be moved at one time and the slider that was moved last is the most important characteristic to find the closest match to. I want to find in my library the child that has the closest to all three of these values and Joe will be the closest.
I was told to check out using a 3 dimensional array but I would rather use an array of child objects and do multiple searches on that array which, I am a rookie and I know the search will take a bit of computing time but I keep leaning towards using the single array. I could also use a two dimensional array but I'm not sure. What data structure would be the best to search for three values in?
Thank you for any help!

How many different values can each slider take? If there are say ten values for each slider this would mean there are 10*10*10=1000 different possible character classes. If your library has less than 1000 walk cycles just reading through them all looking for the nearest match is probably gonna be fast enough.
Of course if there are 100 values for each slider then you may want something more clever. My point is there are some thing that don't have to be optimized.
Also is your library of walk cycles fixed once and for all? If so perhaps you could pre compute the walk cycle for each setting of the sliders and write that to a static array.

I agree with Wilf that the number of walk cycles is critical, as even if there are say 100,000 cycles you could easily use a brute-force find-the-maximum over...
weight_factor * diff(candidate.weight, target.weight) +
age_factor * diff(candidate.age, target.age) +
happiness_factor * diff(candidate.happiness, target.happiness)
...where the factor for the last-moved slider was higher than the others.
For more cycles than that you'd want to limit the search space somewhat, and some indices would be useful, e.g.:
map<int, map< int, map<int, vector<Cycle*>> cycles_by_weight_age_happiness;
You'd populate that adding a pointer to each walk cycle - characterised by { weight, age, happiness } - to cycles[rw(weight)][ra(age)][rh(happiness)], where each of rw, ra and rh rounded the parameters by whatever granularity you liked (e.g. round weight down to nearest 5kgs, group ages by integer part of log base 1.5 of age, leave happiness alone). Then to search you evaluate the entries "around" your target { rw(weight), ra(age), rh(happiness) } indices... the further from there you deviate (especially on the last-slider-moved parameter, the less likely you are to find a better fit than you've already seen, so tune to taste.
The above indexing is a refinement of what I think Wilf intended - just using functions to decouple the mapping from value space into vectors in the index.

Related

Binary Snap [AIO 2015]

This is a question from the Australian Informatics Olympiad
The question is:
Have you ever heard of Melodramia, my friend? It is a land of forbidden forests and boundless swamps, of sprinting heroes and dashing heroines. And it is home to two dragons, Rose and Scarlet, who, despite their competitive streak, are the best of friends.
Rose and Scarlet love playing Binary Snap, a game for two players. The game is played with a deck of cards, each with a numeric label from 1 to N. There are two cards with each possible label, making 2N cards in total. The game goes as follows:
Rose shuffles the cards and places them face down in front of Scarlet.
Scarlet then chooses either the top card, or the second-from-top card from the deck and reveals it.
Scarlet continues to do this until the deck is empty. If at any point the card she reveals has the same label as the previous card she revealed, the cards are a Dragon Pair, and whichever dragon shouts `Snap!' first gains a point.
After many millenia of playing, the dragons noticed that having more possible Dragon Pairs would often lead to a more exciting game. It is for this reason they have summoned you, the village computermancer, to write a program that reads in the order of cards in the shuffled deck and outputs the maximum number of Dragon Pairs that the dragons can find.
I'm not sure how to solve this. I thought of something which is wrong(choosing the maximum over all cards, when compared with its previous occurence for each card)
Here's my code as of now:
#include <iostream>
#include <fstream>
using namespace std;
int main() {
ifstream fin("snapin.txt");
ofstream fout("snapout.txt");
int n;
fin>>n;
int arr[(2*n)+1];
for(int i=0;i<2*n;i++){
fin>>arr[i];
}
int dp[(2*n) +1];
int maxi = 0;
int pos[n+1];
for(int i=0;i<n+1;i++){
pos[i] = -1;
}
int count = 0;
for(int i=2;i<(2*n)-2;i++){
if(pos[arr[i]] == -1){
pos[arr[i]] = i;
}else{
dp[i] = pos[arr[i]]+1;
maxi = max(dp[i],maxi);
}
dp[i] = max(dp[i],maxi);
}
fout<<dp[2*n -1];
}
Ok, let's get some basic measurements of the problem out of the way first:
There are 2N cards. 1 card is drawn at a time, without replacement. Therefore there are 2N draws, taking the deck from size 2N (before the first draw) to size 0 (after the last draw).
The final draw takes place from a deck of size 1, and must take the last remaining card.
The 2N-1 preceding draws have deck size 2N, ... 3, 2. For each of these you have a choice between the top two cards. 2N-1 decisions, each with 2 possibilities.
The brute force search space is therefore 22N-1.
That is exponential growth, every optimization scientist's favorite sort of challenge.
If N is small, say 20, the brute force method needs to search "only" a trillion possibilities, which you can get done in a few thousand seconds on a readily available PC that does a few billion operations per second (each solution takes more than one CPU instruction to check).
In N is not quite as small, perhaps 100, the brute force method is akin to breaking the encryption on government secrets.
Not happy with the brute force approach then? I'm not either.
Before we get to the optimal solution, let’s take a break to explore what the Markov assumption is and what it means for us. It shows up in different fields using different verbiage, but I’ll just paraphrase it in a way that is particularly useful for this problem involving gameplay choices:
Markov Assumption
A process is Markov if and only if The choices available to you in the future depend only on what you have now, and not how you got it.
A bad but often used real-world example is the stock market. Not only do taxation differences between short-term and long-term capital gains make history important in a small way, but investors do trend analysis and remember what stocks have done before, which affects future behavior in a big way.
A better example, especially for StackOverflow, is that of Turing machines and computer processors. What your program does next depends on the current instruction pointer and the contents of memory, but not the history of memory that’s since been overwritten. But there are many more. As we’ll see shortly, the Binary Snap problem can be formulated as Markov.
Now let’s talk about what makes the Markov assumption so important. For that, we’ll use the Travelling Salesman Problem. No, the Travelling International Salesman Problem. Still too messy. Let’s try the “Travelling International Salesman with a Single-Entry Visa Problem”. But we’ll go through all three of them briefly:
Travelling Salesman Problem
A salesman has to visit potential buyers in N cities. Plan an itinerary for the salesman which minimizes the total cost of visiting all N cities (variations: at least once / exactly once), given a matrix aj,k which is the cost of travel from city j to city k.
Another variation is whether the starting city is predetermined or not.
Travelling International Salesman Problem
The cities the salesman needs to visit are split between two (or more) nations. A subset of the cities have border crossings and have travel options to all cities. The other cities can only reach cities which are either in the same country or are border-equipped.
Alternatively, instead of cities along the border, use cities with international airports. Won’t make a difference in the end.
The cost matrix for this problem looks rather like the flag of the Dominican Republic. Travel between interior cities of country A is permitted, as is travel between interior cities of country B (blue fields). Border cities connect with interior and border cities in both countries (white cross). And direct travel between an interior city of country A and one of country B is impossible (red areas).
Travelling International Salesman with a Single-Entry Visa
Now not only does the salesman need to visit cities in both countries, but he can only cross the border once.
(For travel fanatics, assume he starts in a third country and has single-entry visas for both countries, so he can’t visit some of A, all of B, then return to A for the rest).
Let’s look at an extremely simple case first: Only one border city. We’ll use one additional trick, the one from proof by induction: We assume that all problems smaller than the current one can be solved.
It should be fairly obvious that the Markov assumption holds when the salesman reaches the border city. No matter what path he took through country A, he has exactly the same choice of paths through country B.
But there’s a really important point here: Any path through country A ending at the border and any path through country B starting at the border, can be combined into a feasible full itinerary. If we have two full itineraries x and y, and x spent more money in country A than y did, then even if x has a lower total cost than the total cost of y, we can plan a path better than both, using the portion of y in country A and the portion of x in country B. I’m going to call that “splicing”. The Markov assumption lets us do it, by making all roads leading to the border interchangeable!
In fact, we can look just at the cities of country A, pick the best of all routes to the border, and forget about all the other options as soon as (in our plan) the salesman steps across into B.
This means instead of having factorial(NA) * factorial(NB) routes to look at, there’s only factorial(NA) + factorial(NB). Which is pretty much factorial(NA) times better. Wow, is this Markov thing helpful or what?
Ok, that was too easy. Let’s mess it all up by having NAB border cities instead of just one. Now if I have a path x which costs less in country B and a path y which costs less in country A, but they cross the border in different cities, I can’t just splice them together. So I have to keep track of all the paths through all the cities again, right?
Not exactly. What if, instead of throwing away all the paths through country A except the best y path, I actually keep one path ending in each border city (the lowest cost of all paths ending in the same border city). Now, for any path x I look at in country B, I have a path yendpt(x) that uses the same border city, to splice it with. So I have to solve the country A and country B partitions each NAB times to find the best splice of a complete itinerary, for total work of NAB factorial(NA) + NAB factorial(NB) which is still way better than factorial(NA) * factorial(NB).
Enough development of tools. Let’s get back to the dragons, since they are they are subtle and quick to anger and I don’t want to be eaten or burnt to a crisp.
I claim that at any step T of the Binary Snap game, if we consider our “location” a pair of (card just drawn, card on top of deck), the Markov assumption will hold. These are the only things that determine our future options. All the cards below the top one in the deck must be in the same order no matter what we did before. And for knowing whether to count a Snap! with the next card, we need to know the last one taken. And that’s it!
Furthermore, there are N possible labels on the card last drawn, and N possible for the top card on the deck, for a total of N2 “border cities”. As we start playing the game, there are two choices on the first turn, two on the second, two on the third, so we start out with 2T possible game states (and a count of Snap!s for each). But by the pigeonhole principle, when 2T > N2, some of these plays must end in exactly the same game state (“border city”) as each other, and when that happens, we only need to keep the "dominating" one that got the best score on the way there.
Final complexity bound: 2*N timesteps, from no more than N2 game states, with 2 draw choices at each, equals an upper limit of 4*N3 simulated draws.
And that means the same trillion calculations that allowed us to do N=20 with the brute force method, now permit right around N=8000.
That makes the dragons happy, which makes us alive and well.
Implementation note: Since the challenge didn’t ask for the order of draws, but just the highest attainable number of snaps, all you data to keep track of in addition to the initial ordering of the cards is the time, T, and a 2-dimensional array (N rows, N columns) of the best score you can have and reach that state at time T.
Real world applications: If you take this approach and apply it to a digital radio (fixed uniform bit timing, discrete signal levels) receiving a signal using a convolutional error-correcting code, you have the Viterbi decoder. If you apply it to acquired medical data, with variable timing intervals and continuous signal levels, and add some other gnarly math, you get my doctoral project.

Shortest cost path

I have to find the shortest path from point D to R. These are fixed points.
This is an example of a situation:
The box also contains walls, through which you cannot pass across them, unless you break them. Each wall break costs you, let's say "a" points, where "a" is a positive integer.
Each move which doesn't involve a wall, costs you 1 point.
The mission is to find out of all the paths of minimum cost, the one with the least number of broken walls.
Since, the width of the box can go up to 100 cells, it's irrelevant to use backtracking. It's too slow. The only solution I came up is this one:
Go east or south if there are no walls
If south has a wall, check if west has wall. If west has wall, break south wall. If west doesn't have wall, go west, until you find a south cell without wall. Repeat this process with south and east until you exceed the cost of a broken wall in this order. If path from west goes into the same place as if I had broken the south wall and costs the same or less than "a" points, then use this path, else brake south wall.
If nothing above encounters, brake a south or east wall, depending on the box boundary.
Repeat steps 1, 2, 3 till the "passenger" arrives in point R. Between these 3 steps, there are "else-if" relations.
Can you come up with a better problem algorithm? I program in C++.
Use Dijkstra, but for costs give it 1 for a move that doesn't break a wall, and (a+0.00001) for breaking a wall. Then Dijkstra will give you what you want, the path that breaks the fewest walls among all minimal-cost paths.
Conceptually, imagine a traveler who can jump over walls -- while keeping track of the cost -- and can also split into two identical travelers when faced with a choice of two paths, so as to take them both (take that, Robert Frost!). Only one traveler moves at a time, the one who has incurred the lowest cost so far. That one moves, and writes on the floor "I reached here at a cost of only x". If I find such a note already there, if I got there more cheaply I erase the old note and write my own; if that other traveler got there more cheaply I commit suicide.
The two-part "cost first, then broken walls", can be represented as a pair (c, w) that is compared lexicographically. c is the cost, w is the number of broken walls. That makes it a "single thing" again (in some sense), so it's a thing that you can put into algorithms and so on that expect simply "a cost" (as an abstract thing that it may add an other cost to or compare to an other cost).
So we can just use A*, with a Manhattan Distance heuristic (perhaps there's something smarter that doesn't ignore walls completely, but this will work - underestimating the distance is admissible). The movement cost will, of course, not ignore walls. Neighbours will be all adjacent squares. All costs will be the pair I described above.
This could easily be modeled as a weighted graph and then apply Dijkstra's shortest path algorithm to it. Each square is a node. It is connected to the nodes of the squares it is adjacent to. The weight of the connections is either 1 or "a", based on whether there is a wall or not. This will get you the minimal cost. It's possible that the minimum cost and the minimum number of wall breaks could be different.
Here is a general algorithm (you'll have to do the implementation yourself):
Convert the matrix into a weighted graph:
For each entry in the matrix, create a Vertex.
For each Vertex, create an array of Edges, one for each neighboring Vertex.
For each Edge, define a weight according to the cost of breaking the wall between the two Vertices that the Edge is connecting.
Then, run the Dijkstra's algorithm (http://en.wikipedia.org/wiki/Dijkstra%27s_algorithm) on the graph, starting from Vertex D. As an output, you will have the shortest (cheapest) path from Vertex D to any other Vertex on the graph, including Vertex R.

'Stable' multi-dimensional scaling algorithm

I have a wireless mesh network of nodes, each of which is capable of reporting its 'distance' to its neighbors, measured in (simplified) signal strength to them. The nodes are geographically in 3d space but because of radio interference, the distance between nodes need not be trigonometrically (trigonomically?) consistent. I.e., given nodes A, B and C, the distance between A and B might be 10, between A and C also 10, yet between B and C 100.
What I want to do is visualize the logical network layout in terms of connectness of nodes, i.e. include the logical distance between nodes in the visual.
So far my research has shown the multidimensional scaling (MDS) is designed for exactly this sort of thing. Given that my data can be directly expressed as a 2d distance matrix, it's even a simpler form of the more general MDS.
Now, there seem to be many MDS algorithms, see e.g. http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html and http://tapkee.lisitsyn.me/ . I need to do this in C++ and I'm hoping I can use a ready-made component, i.e. not have to re-implement an algo from a paper. So, I thought this: https://sites.google.com/site/simpmatrix/ would be the ticket. And it works, but:
The layout is not stable, i.e. every time the algorithm is re-run, the position of the nodes changes (see differences between image 1 and 2 below - this is from having been run twice, without any further changes). This is due to the initialization matrix (which contains the initial location of each node, which the algorithm then iteratively corrects) that is passed to this algorithm - I pass an empty one and then the implementation derives a random one. In general, the layout does approach the layout I expected from the given input data. Furthermore, between different runs, the direction of nodes (clockwise or counterclockwise) can change. See image 3 below.
The 'solution' I thought was obvious, was to pass a stable default initialization matrix. But when I put all nodes initially in the same place, they're not moved at all; when I put them on one axis (node 0 at 0,0 ; node 1 at 1,0 ; node 2 at 2,0 etc.), they are moved along that axis only. (see image 4 below). The relative distances between them are OK, though.
So it seems like this algorithm only changes distance between nodes, but doesn't change their location.
Thanks for reading this far - my questions are (I'd be happy to get just one or a few of them answered as each of them might give me a clue as to what direction to continue in):
Where can I find more information on the properties of each of the many MDS algorithms?
Is there an algorithm that derives the complete location of each node in a network, without having to pass an initial position for each node?
Is there a solid way to estimate the location of each point so that the algorithm can then correctly scale the distance between them? I have no geographic location of each of these nodes, that is the whole point of this exercise.
Are there any algorithms to keep the 'angle' at which the network is derived constant between runs?
If all else fails, my next option is going to be to use the algorithm I mentioned above, increase the number of iterations to keep the variability between runs at around a few pixels (I'd have to experiment with how many iterations that would take), then 'rotate' each node around node 0 to, for example, align nodes 0 and 1 on a horizontal line from left to right; that way, I would 'correct' the location of the points after their relative distances have been determined by the MDS algorithm. I would have to correct for the order of connected nodes (clockwise or counterclockwise) around each node as well. This might become hairy quite quickly.
Obviously I'd prefer a stable algorithmic solution - increasing iterations to smooth out the randomness is not very reliable.
Thanks.
EDIT: I was referred to cs.stackexchange.com and some comments have been made there; for algorithmic suggestions, please see https://cs.stackexchange.com/questions/18439/stable-multi-dimensional-scaling-algorithm .
Image 1 - with random initialization matrix:
Image 2 - after running with same input data, rotated when compared to 1:
Image 3 - same as previous 2, but nodes 1-3 are in another direction:
Image 4 - with the initial layout of the nodes on one line, their position on the y axis isn't changed:
Most scaling algorithms effectively set "springs" between nodes, where the resting length of the spring is the desired length of the edge. They then attempt to minimize the energy of the system of springs. When you initialize all the nodes on top of each other though, the amount of energy released when any one node is moved is the same in every direction. So the gradient of energy with respect to each node's position is zero, so the algorithm leaves the node where it is. Similarly if you start them all in a straight line, the gradient is always along that line, so the nodes are only ever moved along it.
(That's a flawed explanation in many respects, but it works for an intuition)
Try initializing the nodes to lie on the unit circle, on a grid or in any other fashion such that they aren't all co-linear. Assuming the library algorithm's update scheme is deterministic, that should give you reproducible visualizations and avoid degeneracy conditions.
If the library is non-deterministic, either find another library which is deterministic, or open up the source code and replace the randomness generator with a PRNG initialized with a fixed seed. I'd recommend the former option though, as other, more advanced libraries should allow you to set edges you want to "ignore" too.
I have read the codes of the "SimpleMatrix" MDS library and found that it use a random permutation matrix to decide the order of points. After fix the permutation order (just use srand(12345) instead of srand(time(0))), the result of the same data is unchanged.
Obviously there's no exact solution in general to this problem; with just 4 nodes ABCD and distances AB=BC=AC=AD=BD=1 CD=10 you cannot clearly draw a suitable 2D diagram (and not even a 3D one).
What those algorithms do is just placing springs between the nodes and then simulate a repulsion/attraction (depending on if the spring is shorter or longer than prescribed distance) probably also adding spatial friction to avoid resonance and explosion.
To keep a "stable" diagram just build a solution and then only update the distances, re-using the current position from previous solution as starting point. Picking two fixed nodes and aligning them seems a good idea to prevent a slow drift but I'd say that spring forces never end up creating a rotational momentum and thus I'd expect that just scaling and centering the solution should be enough anyway.

Number of tours through m x n grid?

Let T(x,y) be the number of tours over a X × Y grid such that:
the tour starts in the top left square
the tour consists of moves that are up, down, left, or right one square,
the tour visits each square exactly once, and
the tour ends in the bottom left square.
It’s easy to see, for example, that T(2,2) = 1, T(3,3) = 2, T(4,3) = 0, and T(3,4) = 4.
Write a program to calculate T(10,4).
I have been working on this for hours ... I need a program that takes the dimensions of the grid as input and returns the number of possible tours? Any idea on how I should go about solving this?
Since you're new to backtracking, this might give you an idea how you could solve this:
You need some data structure to represent the state of the cells on the grid (visited/not visited).
Your algorithm:
step(posx, posy, steps_left)
if it is not a valid position, or already visited
return
if it's the last step and you are at the target cell
you've found a solution, increment counter
return
mark cell as visited
for each possible direction:
step(posx_next, posy_next, steps_left-1)
mark cell as not visited
and run with
step(0, 0, sizex*sizey)
The basic building blocks of backtracking are: evaluation of the current state, marking, the recursive step and the unmarking.
This will work fine for small boards. The real fun starts with larger boards where you have to cut branches on the tree which aren't solvable (eg: there's an unreachable area of not visited cells).
The assigned exercise is a good one. It forces you to think through several concepts, step-by-step. I cannot think all the concepts through for you, but maybe I can help by asking the following question.
At some point, your program must represent a partially completed tour. That is, it must represent a path which does not yet pass through all the squares and has not yet reached its target in the bottom left, but which might do both if the path were later extended. How do you mean to represent a partially completed tour?
If you can answer the question, and if you grasp the concept of recursion, then one suspects that you can solve the problem with some work but without too much real trouble. To represent the partially completed tour is your obstacle, so my recommendation is that you go to work on that.
Update: See the comment of #KarolyHorvath below. If you have not yet learned the use of dynamically allocated memory (or, equivalently, of STL containers like std::vector and std::list), then you should rather follow his hint, which is a good hint in any case.

Minimizing Sum of Distances: Optimization Problem

The actual question goes like this:
McDonald's is planning to open a number of joints (say n) along a straight highway. These joints require warehouses to store their food. A warehouse can store food for any number of joints, but has to be located at one of the joints only. McD has a limited number of warehouses (say k) available, and wants to place them in such a way that the average distance of joints from their nearest warehouse is minimized.
Given an array (n elements) of coordinates of the joints and an integer 'k', return an array of 'k' elements giving the coordinates of the optimal positioning of warehouses.
Sorry, I don't have any examples available since I'm writing this down from memory. Anyway, one sample could be:
array={1,3,4,5,7,7,8,10,11} (n=9)
k=1
Ans: {7}
This is what I've been thinking: For k=1, we can simply find out the median of the set, which would give the optimal location of the warehouse. However, for k>1, the given set should be divided into 'k' subsets (disjoint, and of contiguous elements of the superset), and median for each subset would give the warehouse locations. However, I don't understand on what basis the 'k' subsets should be formed. Thanks in advance.
EDIT: There's a variation to this problem also: Instead of sum/avg, minimize the maximum distance between a joint and its closest warehouse. I don't get this either..
The straight highway makes this an exercise in dynamic programming, working from left to right along the highway. A partial solution can be described by the location of the rightmost warehouse and the number of warehouses placed. The cost of the partial solution will be the total distance to the nearest warehouse (for fixed k minimising this is the same as minimising the averge) or the maximum distance so far to the closest warehouse.
At each stage you have worked out the answers for the leftmost N joints and have them indexed by number of warehouses used and position of the rightmost warehouse - you need to save only the best cost. Now consider the next joint and work out the best solution for N+1 joints and all possible values of k and rightmost warehouse, using the answers you have stored for N joints to speed this up. Once you have worked out the best cost solution covering all the joints you know where its rightmost warehouse is, which gives you the location of one warehouse. Go back to the solution that has that warehouse as the rightmost joint and find out what solution that was based on. That gives you one more rightmost warehouse - and so you can work your way back to the location of all the warehouses for the best solution.
I tend to get the cost of working this out wrong, but with N joints and k warehouses to place you have N steps to take, each of the based on considering no more than Nk previous solutions, so I reckon cost is kN^2.
This is NOT a clustering problem, it's a special case of a facility location problem. You can solve it using a general integer / linear programming package, but because the problem is on a line, there may be more efficient (and less expensive software-wise) algorithms that would work. You might consider dynamic programming since there are probably combination of facilities that could be eliminated rather quickly. Look into the P-Median problem for more info.