How can I code this problem? (C++) - c++

I am writing a simple game which stores datasets in a 2D grid (like a chess board). Each cell in the grid may contain a single integer (0 means the cell is empty). If the cell contains a number > 0, it is said to be "filled". The set of "filled" cells on the grid is known as a "configuration".
My problems is being able to "recognize" a particular configuration, regardless of where the configuration of cells are in the MxN grid.
The problem (in my mind), breaks down into the following 2 sub problems:
Somehow "normalising" the position of a configuration (for e.g. "rebasing" its position to (0,0), such that the smallest rectangle containing all cells in the configuration has its left vertice at (0,0) in the MxN grid
Computing some kind of similarity metric (or maybe simply set difference?), to determine if the current "normalised" configuration is one of the known configurations (i.e. "recognized")
I suspect that I will need to use std::set (one of the few STL containers I haven't used so far, in all my years as a C++ coder!). I would appreciate any ideas/tips from anyone who has solved such a problem before. Any code snippets, pseudocode and/or links would be very useful indeed.

Similarity metrics are a massive area of academic research. You need to decide what level of sophistication is required for your task. It may be that you can simply drag a "template pattern" across your board, raster style, and for each location score +1 for a hit and -1 for a miss and sum the score. Then find the location where the template got the highest score. This score_max is then your similarity metric for that template. If this method is inadequate then you may need to go in to more detail about the precise nature of the game.

Maybe you could use some hash function to identify configurations. Since you need to recognize patterns even if they are not at the same position on the board, this hash should not depend on the position of the cells but on the way they are organized.
If you store your 2D grid in a 1D array, you would need to find the first filled cell and start calculating the hash from here, until the last filled cell.
Ex:
-----------------
| | | | |
-----------------
| | X | X | |
-----------------
| | | X | |
-----------------
| | | | |
-----------------
----------------+---------------+---------------+----------------
| | | | | | X | X | | | | X | | | | | |
----------------+---------------+---------------+----------------
|_______________________|
|
Compute hash based on this part of the array
However, there are cases where this won't work, e.g. when the pattern is shifted across lines:
-----------------
| | | | X |
-----------------
| X | | | |
----------------- Different configuration in 2D.
| X | | | |
-----------------
| | | | |
-----------------
----------------+---------------+---------------+----------------
| | | | X | X | | | | X | | | | | | | |
----------------+---------------+---------------+----------------
|_______________________|
Seems similar in 1D
So you'll need some way of dealing with these cases. I don't have any solution yet, but I'll try to find something if my schedule allows it!
After thinking a bit about it, maybe you could use two different representations for the grid: one where the lines are appended in a 1D array, and another one where the columns are appended in a 1D array. The hash would be calculated with these two representations, and that would (I think) resolve the problem evoked above.

This may be overkill for a small application, but OpenCV has some excellent image recognition and blob finding routines. If you treat the 2D board as an image, and the integer as brightness, it should be possible to use functions from that library.
And the link:
http://opencv.willowgarage.com/documentation/index.html

you can use a neural network for that job.
if you lookfor neural network shape recognition i think that you can find something usefull. you can find tons of library that may help you but if you have no experiece with NN this could be a little hard, but i think that is the easiest way

Sounds like you want to feed your chessboard to a neural network trained to recognize the configuration.
This is very similar to the classic examples of image classification, with the only complication being that you don't know exactly where your configuration thingy will appear in the grid, unless you're always considering the whole grid - in that case a classic 2 layers network should work.
HTM neural networks implementations solve the offset problem out-of-the-box. You can find plenty of ready-to-use implementations starting from here. Obviously you'll have to tweak the heck out of the implementations you'll find but you should be able to achieve exactly what you're asking for to my understanding.
If you want to further investigate this route the Numenta forum will be a good starting point.

This reminds me of HashLife which uses QuadTrees. Check out the wikipedia pages on HashLife and Quadtrees.
There's also a link at the bottom of the Wikipedia page to a DrDobbs article about actually implementing such an algorithm: http://www.ddj.com/hpc-high-performance-computing/184406478
I hope those links help. They're interesting reads if nothing else.

As to the first part of question,i.e debasing try this:
make a structure with 2 integers .Declare a pointer of that struct type. Input (or compute number of live cells) and assign that much storage (using routines like calloc) .Input the coordinates in the structure. Compute the minimum x coordinate and minimum y coordinate. In the universe assign [x][y] (user given values or current coordinate) to [x-minx][y-miny] Though expensive when reading from an already filled grid,but works and helps in the subsequent part of the question.

Related

Sorting query by distance requires reading entire data set?

To perform geoqueries in DynamoDB, there are libraries in AWS (https://aws.amazon.com/blogs/mobile/geo-library-for-amazon-dynamodb-part-1-table-structure/). But to sort the results of a geoquery by distance, the entire dataset must be read, correct? If a geoquery produces a large number of results, there is no way to paginate that (on the backend, not to the user) if you're sorting by distance, is there?
You are correct. To sort all of the datapoint by distance from some arbitrary location, you must read all the data from your DynamoDB table.
In DynamoDB, you can only sort results using a pre-computed value that has been stored in the DynamoDB table and is being used as the sort key of the table or one of its indexes. If you need to sort by distance from a fixed location, then you can do this with DynamoDB.
Possible Workaround (with limitations)
TLDR; it's not such a bad problem if you can get away with only sorting the items that are within X kms from an arbitrary point.
This still involves sorting the data points in memory, but it makes the problem easier by producing incomplete results (by limiting the maximum range of the results.)
To do this, you need the Geohash of your point P (from which you are measuring the distance of all other points). Suppose it is A234311. Then you need to pick what range of results is appropriate. Let's put some numbers on this to make it concrete. (I'm totally making these numbers up because the actual numbers are irrelevant for understanding the concepts.)
A - represents a 6400km by 6400km area
2 - represents a 3200km by 3200km area within A
3 - represents a 1600km by 1600km area within A2
4 - represents a 800km by 800km area within A23
3 - represents a 400km by 400km area within A234
1 - represents a 200km by 200km area within A2343
1 - represents a 100km by 100km area within A23431
Graphically, it might look like this:
View of A View of A23
|----------|-----------| |----------|-----------|
| | A21 | A22 | | | |
| A1 |-----|-----| | A231 | A232 |
| | A23 | A24 | | | |
|----------|-----------| |----------|-----------|
| | | | |A2341|A2342|
| A3 | A4 | | A233 |-----|-----|
| | | | |A2343|A2344|
|----------|-----------| |----------|-----------| ... and so on.
In this case, our point P is in A224132. Suppose also, that we want to get the sorted points within 400km. A2343 is 400km by 400km, so we need to load the result from A2343 and all of its 8-connected neighbors (A2341, A2342, A2344, A2334, A2332, A4112, A4121, A4122). Then once we've loaded only those in memory, then you calculate the distances, sort them, and discard any results that are more than 400km.
(You could keep the results that are more than 400km away as long as the users/clients know that beyond 400km, the data could be incomplete.)
The hashing method that DynamoDB Geo library uses is very similar to a Z-Order Curve—you may find it helpful to familiarize yourself with that method as well as Part 1 and Part 2 of the AWS Database Blog on Z-Order Indexing for Multifaceted Queries in DynamoDB.
Not exactly. When querying location you can query by a fixed query value (partition key value) and by sort key, so you can limit your query data result and also apply a little filtering.
I have been racking my brain while designing a DynamoDB Geo Hash proximity locator service. For this example customer_A wants to find all service providers_X in their area. All customers and providers have a 'g8' key that stores their precise geoHash location (to 8 levels).
The accepted way to accomplish this search is to generate a secondary index from the main table with a less accurate geoHash 'g4' which gives a broader area for the main query key. I am applying key overloading and composite key structures for a single table design. The goal in this design is to return all the data required in a single query, secondary indexes can duplicate data by design (storage is cheap but cpu and bandwidth is not)
GSI1PK GSI1SK providerId Projected keys and attributes
---------------------------------------------
g4_9q5c provider pr_providerId1 name rating
g4_9q5c provider pr_providerId2 name rating
g4_9q5h provider pr_providerId3 name rating
Scenario1: customer_A.g8_9q5cfmtk So you issue a query where GSI1PK=g4_9q5c and a list of two providers is returned, not three I desire.
But using geoHash.neighbor() will return eight surrounding neighbors like 9q5h (see reference below). That's great because there a provider in 9q5h but this means I have to run nine queries, one on the center and eight on the neighbors, or run 1-N until I have the minimum results I require.
But which direction to query second, NW, SW, E?? This would require another level of hinting toward which neighbor has more results, without knowing first, unless you run a pre-query for weighted results. But then you run the risk of only returning favorable neighbors as there could be new providers in previously unfavored neighbors. You could apply some ML and randomized query into neighbors to check current counts.
Before the above approach I tried this design.
GSI1PK GSI1SK providerId Projected keys and attributes
---------------------------------------------
loc g8_9q5cfmtk pr_provider1
loc g8_9q5cfjgq pr_provider2
loc g8_9q5fe954 pr_provider3
Scenario2: customer_A.g8_9q5cfmtk So you issue a query where GSI1PK=loc and GSI1SK in between g8_9q5ca and g8_9q5fz and a list of three providers is returned, but a ton of data was pulled and discarded.
To achieve the above query the between X and Y sort criteria is composed of. 9q5c.neighbors().sorted() = 9q59, 9q5c, 9q5d, 9q5e, 9q5f, 9q5g, 9qh1, 9qh4, 9qh5. So we can just use X=9q59 and Y=9qh5 but there are over 50 (I really didn't count after 50) matching quadrants in such a UTF between function.
Regarding the hash/size table above I would recommend to use this https://www.movable-type.co.uk/scripts/geohash.html
Geohash length Cell width Cell height
1 ≤ 5,000km × 5,000km
2 ≤ 1,250km × 625km
3 ≤ 156km × 156km
4 ≤ 39.1km × 19.5km
5 ≤ 4.89km × 4.89km
...

Finding the overlapping locations for a 5 mile and 10 mile radius for a list of location data with latittude and longitude

I have a 10,000 observation dataset with a list of location information looking like this:
ADDRESS | CITY | STATE | ZIP |LATITUDE |LONGITUDE
1189 Beall Ave | Wooster | OH | 44691 | 40.8110501 |-81.93361870000001
580 West 113th Street | New York City | NY | 10025 | 40.8059768 | -73.96506139999997
268 West Putnam Avenue | Greenwich | CT | 06830 | 40.81776801 |-73.96324589997
1 University Drive | Orange | CA | 92866 | 40.843766801 |-73.9447589997
200 South Pointe Drive | Miami Beach | FL | 33139 | 40.1234801 |-73.966427997
I need to find the overlapping locations within a 5 mile and 10 mile radius. I heard that their is a function called geodist which may allow me to do that, although I have never used it. The problem is that for geodist to work I may need all the combinations of the latitudes and longitudes to be side by side, which may make the file really really large and hard to use. I also, do not know how I would be able to get the lat/longs for every combination to be side by side.
Does anyone know of a way I could get the final output that I am looking for ?
Here is a broad outline of one possible approach to this problem:
Allocate each address into a latitude and longitude 'grid' by rounding the co-ordinates to the nearest 0.01 degrees or something like that.
Within each cell, number all the addresses 1 to n so that each has a unique id.
Write a datastep taking your address dataset as input via a set statement, and also load it into a hash object. Your dataset is fairly small, so you should have no problems fitting the relevant bits in memory.
For each address, calculate distances only to other addresses in the same cell, or other cells within a certain radius, i.e.
Decide which cell to look up
Iterate through all the addresses in that cell using the unique id you created earlier, looking up the co-ordinates of each from the hash object
Use geodist to calculate the distance for each one and output a record if it's a keeper.
This is a bit more work to program, but it is much more efficient than an O(n^2) brute force search. I once used a similar algorithm with a dataset of 1.8m UK postcodes and about 60m points of co-ordinate data.

Fast way to lookup entries with duplicates

I am looking for a way to help me quickly look up duplicate entries in some sort of table structure in a very efficient way. Here's my problem:
I have objects of type Cargo where each Cargo has a list of other associated Cargo. Each cargo also has a type associated with it.
so for example:
class Cargo {
int cargoType;
std::list<Cargo*> associated;
}
now for each cargo object and its associated cargo, there is a certain value assigned to it based on their types. This evaluation happens by classes that implement CargoEvaluator.
Now, I have a CargoEvaluatorManager which basically handles connecting everything together. CargoEvaluators are registered with CargoEvaluatorManager, then, to evaluate cargo, I call CargoEvaluatorManager.EvaluateCargo(Cargo* cargo).
Here's the current state
class CargoEvaluatorManager{
std::vector<std::vector<CargoEvaluator*>> evaluators;
double EvaluateCargo(Cargo* cargo)
{
double value = 0.0;
for(auto& associated : cargo->associated) {
auto evaluator = evaluators[cargo->cargoType][associated->cargoType];
if(evaluator != nullptr)
value += evaluator->Evaluate(cargo, associated);
}
return value;
}
}
So to recap and mention a few extra points:
CargoEvaluatorManager stores CargoEvaluators in a 2-D array using cargo types as indices. The entire 2d vector is initialized with nullptrs. When a CargoEvaluator is registered, resizing the array and the other bits and peieces I haven't shown here are handled appropriately.
I had tried using a map with std::pair as a key to look up different evaluators, but it is too slow.
This only allows one CargoEvaluator per combination of cargotype. I want to have multiple cargo evaluators potentially as well.
I am calling this EvaluateCargo tens and billions of times. I am aware my current implementation is not the most efficient and am looking for alternatives.
What I am looking for
As stated above, I want to do much of what I've outlined with the exception that I want to allow multiple Evaluators for each pair of Cargo types. As I envision it, naively, is a table like this :
--------------------------------------------
|int type 1 | int type 2 | CaroEvaluator* |
|------------------------------------------|
| 5 | 7 | Eval1* |
| 5 | 7 | Eval2* |
| 4 | 6 | Eval3* |
--------------------------------------------
The lookup should be symmetric in that the set (5,7) and (7,5) should resolve to the same entries. For speed I don't mind preloading duplicate entries.
There are maybe 3x or more more associated Caro in the list than there are Evaluators, if that factors into things.
Performance is crucial, as mentioned above!
For bonus points, each cargo evaluator may have an additional penalty value associated with it, that is not dependent on how many Associates a Cargo has. In other words: if a row in the table above is looked up, I want to call double Evaluator->AddPenality() once and only once each time EvaluateCargo is called. I cannot store any instance variables since it would cause some multithreading issues.
One added constraint is I need to be able to identify the CargoEvaluators associated with a particular cargotype, meaning that hashing the two cargotypes together is not a viable option.
If any further clarification is needed, I'll gladly try to help.
Thank you all in advance!

Change data in table and copying to new table

I would like to make a macro in Excel, but I think it's too complicated to do it with recording... That's why I'm coming here for assistance.
The file:
I have a list of warehouse boxes all containing a specific ID, location (town), location (in or out) and a date.
Whenever boxes change location, this needs to be changed in this list and the date should be adjusted accordingly (this should be a manual input, since the changing of the list might not happen on the same day as the movement of the box).
On top of that, I need to count the number of times the location changes from in to out (so that I know how many times the box has been used).
The way of inputting:
A good way of inputting would be that you can make a list of the boxes where you want to change the information from, f.e.:
ID | Location (town) | Location (in/out) | Date
------------------------------------------------
123-4 | Paris | OUT | 9-1-14
124-8 | London | IN | 9-1-14
999-84| London | IN | 10-1-14
124-8 | New York | OUT | 9-1-14
Then I'd make a button that uses the macro to change the data mentioned above in the master list (where all the boxes are listed) and on some way count the number of times OUT changes to IN etc.
Is this possible?
I'm not entirely sure what you want updated in your Main List but I don't think you need Macros at all to achieve this. You can count the number of times and box location has changed by simply making a list of all your boxes in one column and the count in the next column. For the count use the formula COUNTIFS to count all the rows where the box id is the same and the location is in/out. Check VLOOKUP for updating your main list values.

What's faster? Searching for shortest path in a matrix or a list?

I have to store some cities and the distances between some of them and then search for the shortest path. The cities and the distances are read from a file.
I started with doing a matrix but saw that it took too much space(more than double) so I changed to a list. Each list item stores 3 things: point1, point2 and the distance between them.
So for example I have this file:
Athens Stockholm 34
Stockholm Prague 23
which when I read is stored in the array as this:
_____0______ ______1______
point1 | Athens | Stockholm |
point2 | Stockholm | Prague |
distance | 34 | 23 |
------------ -------------
Then I got some doubts.. This surely saves space but is it going to take more time to go through? The list is an array but the connections(edges) are placed in an arbitrary way and that's why I started thinking that it may take more time than if I used a matrix.
You might want to look into the adjacency-list representation of graphs, which is a modified version of your second idea that's best suited for shortest path problems. The idea is to have a table of nodes where for each node you store a list of outgoing edges from that node. This allows you to iterate over all the edges leaving a node in time proportional to the number of edges you have leaving a node, not the total number of edges in the graph (as you have in both the matrix and list versions suggested above). For this reason, most fast algorithms for doing graph searches use adjacency lists.
Hope his helps!
Separate the names from the distances. Create a list that just contains city names.
_____0______ ______1______ ______2______
city | Athens | Stockholm | Prague |
------------ ------------- -------------
Then create the matrix separately
__0__ __1__ __2__
0 | 0 | 34 | 0 |
----- ----- -----
1 | 34 | 0 | 23 |
----- ----- -----
2 | 0 | 23 | 0 |
----- ----- -----
If you want to search, say, a route from Prague to Athens, then you start by finding where Prague and Athens are in the list...
Prague: 2
Athens: 0
Then you search through the matrix for your path.
(2,1,23) -> (1,0,34)
Finally, you translate to cities using your list
(Prague, Stockholm, 23) ->
(Stockholm, Athens, 34)
I think that adjecency lists here are surely the best option here. The're later very useful when it comes to algorithms like D/BFS or Dijkstra.
If you don't know how to keep both towns and distances just use some structure to keep them both together. If you could use numbered indexed towns you would use just a simple pair structure (also the easiest implementation would be n STL vectors of pair.
Of course if you don't want to use STL you should try implementing own lists of structures with pointers you would want to use.
Your approach looks just fine.
To relax your mind, remember that parsing a single list/array will always be be faster and more resource friendly than working with two (or more) lists when you practically just need to look up a single line/entry of predefined data.
I tend to disagree with some of the other answers here, since I do not see any need to complicate things. Looking up several data-cells and the additional need to be combine those data-cells to produce a resulting data set (as some answers proposed), takes more steps than doing a simple one-time run over a list to fetch a line of data. You would merely risk losing CPU cycles and memory resources for functions that lookup and combine distributed data-cells across several lists, while the data you currently use already is combined as a collection of perfect results.
Simpler said: doing a simple run-down on the list/array you currently have is hard to beat when it comes to speed.