How to find right data structure for a searching application?

How to find right data structure for a searching application? - c++

My question can be asked in two different aspects: one is from data structure perspective, and the other is from image processing perspective. Let's begin with the data structure perspective: suppose now I have a component composed of several small items as the following class shows:
class Component
{
public:
struct Point
{
float x_;
float y_;
};
Point center;
Point bottom;
Point top;
}
In the above example, the Component class is composed of member variables such as center, bottom and top (small items).
Now I have a stack of components (the number of components is between 1000 and 10000), and each component in the stack has been assigned different values, which means there are no duplicate components in the stack. Then, if one small item in the component, for example, 'center' in the illustrated class is known, we can find the unique component in the stack. After that, we can retrieve other properties in the component. Then my question is, how to build a right container data structure to make the searching easier? Now I am considering to use vector and find algorithm in STL(Pseudocode）:
vector<Component> comArray;
comArray.push_back( component1);
.....
comArray.push_back(componentn);
find(comArray.begin(), comArray.end(), center);
I was wondering whether there are more efficient containers to solve this problem.
I can also explain my question from image processing perspective. In image processing, connect component analysis is a very important step for object recognition. Now for my application I can obtain all the connect components in the image, and I also find interesting objects should fulfill the following requirement: their connect component centers should be in a specific range. Therefore, given this constraint, I can eliminate many connected components and then work on the candidate ones. The key step in the above procedure is to how to search for candidate connected components if the central coordinate constraint is given. Any idea will be appreciated.

If you need to be able to get them rather fast, here's a little strange solution for you.
Note that it is a bad solution general-speaking, but it may suit you.
You could make an ordinary vector< component >. Or that can even be a simple array. Then make three maps:
map< Point, Component *>center
map< Point, Component *>bottom
map< Point, Component *>top
Fill them with all the available values of center, bottom and top as keys, and provide pointers to the corresponding Components as values (you could also use just indexes in a vector, so it would be map< Point, int >).
After that, you just use center[Key], bottom[Key] or top[Key], and get either your value (if you store pointers), or the index of your value in the array (if you store indexes).
I wouldn't use such an approach often, but it could work if the values will not change (so you can fill the index maps once), and the data amount is rather big (so searching through an unsorted vector is slow), and you will need to search often.

You probably want spatial indexing data structures.

I think you want to use a map or a hash_map to efficiently lookup your component based on a "center" value.
std::map<Component::Point, Component> lookuptable;
lookuptable[component1.center] = component1;
....
auto iterator = lookuptable.find(someCenterValue)
if (iterator != lookuptable.end())
{
componentN = iterator->second;
}
As for finding elements in your set that are within a given coordinated range. There are several ways to do this. One easy way is to just to have two sorted arrays of the component list, one sorted on the X axis and the other on the Y axis. Then to find the matching elements, you just do a binary search on either axis for the one closest to your target. Then expand scan up and down the array until you go out of range. You could also look at using a kd-tree and find all the nearest neighbors.

If you want to access them in const time and don't want to modify it. I think std::set is good choice for you code.
set<Component> comArray;
comArray.push_back( component1);
.....
comArray.push_back(componentn);
set<Component>::iterator iter = comArray.find(center)
of course, you should write operator== for class Component and nesting struct Point.

Related

Efficiently search among pairs of adjacent elements in a `set`

I'm currently working on a problem where I want to maintain the convex hull of a set of linear functions. It might look something like this:
I'm using a set<Line> to maintain the lines so that I can dynamically insert lines, which works fine. The lines are ordered by increasing slope, which is defined by the operator< of the lines. By throwing out "superseded" lines, the data structure guarantees that every line will have some segment that is a part of the convex hull.
Now the problem is that I want to search in this data structure for the crossing point whose X coordinate precedes a given x. Since those crossing points are only implicitely defined by adjacency in the set (in the image above, those are the points N, Q etc.), it seems to be entirely impossible to solve with the set alone, since I don't have
The option to find an element by anything but the primary compare function
The option to "binary search" in the underlying search tree myself, that is, compute the pre-order predecessor or successor of an iterator
The option to access elements by index efficiently
I am thus inclined to use a second set<pair<set<Line>::iterator, set<Line>::iterator> > >, but this seems incredibly hacky. Seeing as we mainly need this for programming contests, I want to minimize code size, so I want to avoid a second set or a custom BBST data structure.
Is there a good way to model this scenario which still let's me maintain the lines dynamically and binary search by the value of a function on adjacent elements, with a reasonable amount of code?

Data structure for handling a list of 3 integers

I'm currently coding a physical simulation on a lattice, I'm interested in describing loops in this lattice, they are closed curved composed by the edges of the lattice cells. I'm storing the information on this lattice cells (by information I mean a Boolean variable saying if the edge is valuable or no for composing a loop) in a 3 dimensional Boolean array.
I'm now thinking about a good structure to handle this loops. they are basically a list of edges, so I would need something like an array of 3d integer vectors, each edge being defined by 3 coordinates in my current parameterization. I'm already thinking about building a class around this "list" object as I'll need methods computing the loop diameter and probably more in the future.
But, I'm definitely not so aware of the choice of structure I have to do that, my physics background hasn't taught me enough in C++. And for so, I'd like to hear your suggestion for shaping this piece of code. I would really enjoy discovering some new ways of coding this kid of things.

You want two separate things. One is keeping track of all edges and allowing fast lookup of edge objects by an (int,int,int) index (you probably don't want int there but something like size_t or so). This is entirely independent from your second goal crating ordered subsets of these.
General Collection (1)
Since your edge database is going to be sparse (i.e. only a few of the possible indices will actually identify as a particular edge), my prior suggestion of using a 3d matrix is unsuitable. Instead, you probably want to lookup edges with a hash map.
How easy this is, depends on the expected size of the individual integers. That is, can you manage to have no more than 21 bit per integer (for instance if your integers are short int values, which have only 16 bit), then you can concatenate them to one 64 bit value, which already has an std::hash implementation. Otherwise, you will have to implement your own hash specialisation for, e.g., std::hash<std::array<uint32_t,3>> (which is also quite easy, and highly stackable).
Once you can hash your key, you can throw it into an std::unordered_map and be done with it. That thing is fast.
Loop detection (2)
Then you want to have short-lived data structures for identifying loops, so you want a data structure that extends on one end but never on the other. That means you're probably fine with an std::vector or possibly with an std::deque if you have very large instances (but try the vector first!).
I'd suggest simply keeping the index to an edge in the local vector. You can always lookup the edge object in your unordered_map. Then the question is how to represent the index. If Int represents your integer type (e.g. int, size_t, short, ...) it's probably the most consistent to use an std::array<Int,3> --- if the types of the integers differ, you'll want an std::tuple<...>.

surrounding objects algorithm

I'm working on a game where exactly one object may exist at location (x, y) where x and y are ints. For example, an object may exist at (0, 0) or it may not, but it is not possible for multiple objects to exist there at once.
I am trying to decide which STL container to use for the problem at hand and the best way to solve this problem.
Basically, I start with an object and its (x, y) location. The goal is to determine the tallest, largest possible rectangle based on that object's surrounding objects. The rectangle must be created by using all objects above and below the current object. That is, it must be the tallest that it can possibly be based on the starting object position.
For example, say the following represents my object grid and I am starting with the green object at location (3, 4):
Then, the rectangle I am looking for would be represented by the pink squares below:
So, assuming I start with the object at (3, 4) like the example shows, I will need to check if objects also exist at (2, 4), (4, 4), (3, 3), and (3, 5). If an object exists at any of those locations, I need to repeat the process for the object to find the largest possible rectangle.
These objects are rather rare and the game world is massive. It doesn't seem practical to just new a 2D array for the entire game world since most of the elements would be empty. However, I need to be to index into any position to check if an object is there at any time.
Instead, I thought about using a std::map like so:
std::map< std::pair<int, int>, ObjectData> m_objects;
Then, as I am checking the surrounding objects, I could use map::find() in my loop, checking if the surrounding objects exist:
if(m_objects.find(std::pair<3, 4>) != m_objects.end())
{
//An object exists at (3, 4).
//Add it to the list of surrounding objects.
}
I could potentially be making a lot of calls to map::find() if I decide to do this, but the map would take up much less memory than newing a 2D array of the entire world.
Does anyone have any advice on a simple algorithm I could use to find what I am looking for? Should I continue using a std::map or is there a better container for a problem like this?

How much data do you need to store at each grid location? If you are simply looking for a flag that indicates neighbors you have at least two "low tech" solutions
a) If your grid is sparse, how about each square keeps a neighbor list? So each square knows which neighboring squares are occupied. You'll have some work to do to maintain the lists when a square is occupied or vacated. But neighbor lists mean you don't need a grid map at all
b) If the grid map locations are truly just points, use 1 bit per grid location. The results map will be 8x8=64 times smaller that one that uses bytes for each grid point. Bit operations are lightening fast. A 10,000x10,000 map will take 100,000,000 bits or 12.5MB (approx)

An improvement would be to use a hashmap, if possible. This would allow you to at least do your potential extensive searches with an expected time complexity of O(1).
There's a thread here ( Mapping two integers to one, in a unique and deterministic way) that goes into some detail about how to hash two integers together.
If your compiler supports C++11, you could use std::unordered_map. If not, boost has basically the same thing: http://www.boost.org/doc/libs/1_38_0/doc/html/boost/unordered_map.html

You may want to consider a spatial data structure. If the data is 'sparse', as you say, then doing a quadtree neighbourhood search might save you a lot of processing power. I would personally use an R-tree, but that's most likely because I have an R-tree library that I've written and can easily import.
For example, suppose you have a 1000x1000 grid with 10,000 elements. Assuming for the moment, a uniformly-random distribution, we would (based on the density) expect no more than, say . . . a chain of three to five objects touching in either dimension (at this density, a chain of three vertically-oriented objects will happen with probability 0.01% of the time). Suppose the object under consideration is located at (x,y). A window search, starting at (x-5,y-5) and going to (x+5,y+5) would give you a list of at most 121 elements to perform a linear search through. If your rect-picking algorithm notices that it would be possible to form a taller rectangle (i.e. if a rect under consideration touches the edges of this 11x11 bounding box), just repeat the window search for another 5x5 region in one direction of the original. Repeat as necessary.
This, of course, only works well when you have extremely sparse data. It might be worth adapting an R-tree such that the leaves are an assoc. data structure (i.e. Int -> Int -> Object), but at that point it's probably best to just find a solution that works on denser data.
I'm likely over-thinking this; there is likely a much simpler solution around somewhere.
Some references on R-trees:
The original paper, for the original algorithms.
The Wikipedia page, which has some decent overview on the topic.
The R-tree portal, for datasets and algorithms relating to R-trees.
I'll edit this with a link to my own R-tree implementation (public domain) if I ever get around to cleaning it up a little.

This sounds suspiciously like a homework problem (because it's got that weird condition "The rectangle must be created by using all objects above and below the current object" that makes the solution trivial). But I'll give it a shot anyway. I'm going to use the word "pixel" instead of "object", for convenience.
If your application really deserves heavyweight solutions, you might try storing the pixels in a quadtree (whose leaves contain plain old 2D arrays of just a few thousand pixels each). Or you might group contiguous pixels together into "shapes" (e.g. your example would consist of only one "shape", even though it contains 24 individual pixels). Given an initial unstructured list of pixel coordinates, it's easy to find these shapes; google "union-find". The specific benefit of storing contiguous shapes is that when you're looking for largest rectangles, you only need to consider those pixels that are in the same shape as the initial pixel.
A specific disadvantage of storing contiguous shapes is that if your pixel-objects are moving around (e.g. if they represent monsters in a roguelike game), I'm not sure that the union-find data structure supports incremental updates. You might have to run union-find on every "frame", which would be pretty bad.
Anyway... let's just say you're using a std::unordered_map<std::pair<int,int>, ObjectData*>, because that sounds pretty reasonable to me. (You should almost certainly store pointers in your map, not actual objects, because copying around all those objects is going to be a lot slower than copying pointers.)
typedef std::pair<int, int> Pt;
typedef std::pair<Pt, Pt> Rectangle;
std::unordered_map<Pt, ObjectData *> myObjects;
/* This helper function checks a whole vertical stripe of pixels. */
static bool all_pixels_exist(int x, int min_y, int max_y)
{
assert(min_y <= max_y);
for (int y = min_y; y <= max_y; ++y) {
if (myObjects.find(Pt(x, y)) == myObjects.end())
return false;
}
return true;
}
Rectangle find_tallest_rectangle(int x, int y)
{
assert(myObjects.find(Pt(x,y)) != myObjects.end());
int top = y;
int bottom = y;
while (myObjects.find(Pt(x, top-1) != myObjects.end()) --top;
while (myObjects.find(Pt(x, bottom+1) != myObjects.end()) ++bottom;
// We've now identified the first vertical stripe of pixels.
// The next step is to "paint-roller" that stripe to the left as far as possible...
int left = x;
while (all_pixels_exist(left-1, top, bottom)) --left;
// ...and to the right.
int right = x;
while (all_pixels_exist(right+1, top, bottom)) ++right;
return Rectangle(Pt(top, left), Pt(bottom, right));
}

Bidirectional data structure for this situation

I'm studying a little part of a my game engine and wondering how to optimize some parts.
The situation is quite simple and it is the following:
I have a map of Tiles (stored in a bi-dimensional array) (~260k tiles, but assume many more)
I have a list of Items which always are in at least and at most a tile
A Tile can logically contain infinite amount of Items
During game execution many Items are continuously created and they start from their own Tile
Every Item continuously changes its Tile to one of the neighbors (up, right, down, left)
Up to now every Item has a reference to its actual Tile, and I just keep a list of items.
Every time an Item moves to an adjacent tile I just update item->tile = .. and I'm fine. This works fine but it's unidirectional.
While extending the engine I realized that I have to find all items contained in a tile many times and this is effectively degrading the performance (especially for some situations, in which I have to find all items for a range of tiles, one by one).
This means I would like to find a data structure suitable to find all the items of a specific Tile better than in O(n), but I would like to avoid much overhead in the "moving from one tile to another" phase (now it's just assigning a pointer, I would like to avoid doing many operations there, since it's quite frequent).
I'm thinking about a custom data structure to exploit the fact that items always move to neighbor cell but I'm currently groping in the dark! Any advice would be appreciated, even tricky or cryptic approaches. Unfortunately I can't just waste memory so a good trade-off is needed to.
I'm developing it in C++ with STL but without Boost. (Yes, I do know about multimap, it doesn't satisfy me, but I'll try if I don't find anything better)

struct Coordinate { int x, y; };
map<Coordinate, set<Item*>> tile_items;
This maps coordinates on the tile map to sets of Item pointers indicating which items are on that tile. You wouldn't need an entry for every coordinate, only the ones that actually have items on them. Now, I know you said this:
but I would like to avoid much overhead in the "moving from one tile
to another" phase
And this method would involve adding more overhead in that phase. But have you actually tried something like this yet and determined that it is a problem?

To me I would wrap a std::vector into a matrix type (IE impose 2d access on a 1d array) this give you fast random access to any of your tiles (implementing the matrix is trivial).
use
vector_index=y_pos*y_size+x_pos;
to index a vector of size
vector_size=y_size*x_size;
Then each item can have a std::vector of items (if the amount of items a tile has is very dynamic maybe a deque) again these are random access contains with very minimal overhead.
I would stay away from indirect containers for your use case.
PS: if you want you can have my matrix template.

If you really think having each tile store it's items will cost you too much space, consider using a quadtree to store items then. This allows you to efficiently get all the items on a tile, but leaves your Tile grid in place for item movement.

effective C++ data structure to consider in this case

Greetings code-gurus!
I am writing an algorithm that connects, for instance node_A of Region_A with node_D of Region_D. (node_A and node_D are just integers). There could be 100k+ such nodes.
Assume that the line segment between A and D passes through a number of other regions, B, C, Z . There will be a maximum of 20 regions in between these two nodes.
Each region has its own properties that may vary according to the connection A-D. I want to access these at a later point of time.
I am looking for a good data structure (perhaps an STL container) that can hold this information for a particular connection.
For example, for connection A - D I want to store :
node_A,
node_D,
crosssectional area (computed elsewhere) ,
regionB,
regionB_thickness,
regionB other properties,
regionC, ....
The data can be double , int , string and could also be an array /vector etc.
First I considered creating structs or classes for regionB, regionC etc .
But, for each connection A-D, certain properties like thickness of the region through which this connection passes are different.
There will only be 3 or 4 different things I need to store pertaining to a region.
Which data structure should I consider here (any STL container like vector?) Could you please suggest one? (would appreciate a code snippet)
To access a connection between nodes A-D, I want to make use of int node_A (an index).
This probably means I need to use a hashmap or similar data structure.
Can anyone please suggest a good data structure in C++ that can efficiently
hold this sort of data for connection A -D described above? (would appreciate a code snippet)
thank you!
UPDATE
for some reasons, I can not make use of pkgs like boost. So want to know if I can use any libraries from STL

You should try to group stuff together when you can. You can group the information on each region together with something like the following:
class Region_Info {
Region *ptr;
int thickness;
// Put other properties here.
};
Then, you can more easily create a data structure for your line segment, maybe something like the following:
class Line_Segment {
int node_A;
int node_D;
int crosssectional_area;
std::list<Region_Info>;
};
If you are limited to only 20 regions, then a list should work fine. A vector is also fine if you would prefer.

Have you considered a adjacency array for each node, which stores the nodes it is connected to, along with other data?
First, define a node
class Node
{
int id_;
std::vector<AdjacencyInfo> adjacency_;
}
Where the class AdjacencyInfo can store the myriad data which you need. You can change the Vector to a hashmap with the node id as the key if lookup speed is an issue. For fancy access you may want to overload the [] operator if it is an essential requirement.
So as an example
class Graph
{
std::map<int, Node> graph_;
}

boost has a graph library: boost.graph. Check it out if it is useful in your case.

Well, as everyone else has noticed, that's a graph. The question is, is it a sparse graph, or a dense one? There are generally two ways of representing graphs (more, but you'll probably only need to consider these two) :
adjacency matrix
adjacency list
An adjacency matrix is basically a NxN matrix which stores all the nodes in the first row and column, and connection data (edges) as cells, so you can index edges by vertices. Sorry if my English sucks, not my native language. Anyway, you should only consider adjacency matrix if you have a dense graph, and need to find node->edge->node connections really fast. However, iterating through neighbours or adding/removing vertices in an adjacency matrix is slow, the first requiring N iterations, and the second resizing the array/vector you use to store the matrix.
Your other option is to use an adjacency list. Basically, you have a class that represents a node, and one that represents an edge, that stores all the data for that edge, and two pointers that point to the nodes it's connected to. The node class has a collection of some sort (a list will do), and keeps track of all the edges it's connected to. Then you'll need a manager class, or simply a bunch of functions that operate on your nodes. Adding/connecting nodes is trivial in this case as is listing neighbours or connected edges. However, it's harder to iterate over all the edges. This structure is more flexible than the adjacency matrix and it's better for sparse graphs.
I'm not sure that I understood your question completely, but if I did, I think you'd be better off with an adjacency matrix, seems like you have a dense graph with lots of interconnected nodes and only need connection info.
Wikipedia has a good article on graphs as a data structure, as well as good references and links, and finding examples shouldn't be hard. Hope this helps :
Link

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js