Related
Code Instructions
Hey guys. Above is a coding project I have been assigned. Im reading the instructions and am completely lost because I've never learned how to code an undirected graph? Not sure how my professor expects us to do this but I was hoping I could get some help from experts. Any readings (or tips) you guys suggest I can look at to familiarize myself with how to get started on the program? Appreciate it, thanks!
The problem to solve is called "Word Morph". Your instructor gave some restrictions as to use an undirected graph, where the neighbour node differs only one character from the origin. Unfortuantely the requirements are not clear enough. "Differ by one character is ambiguous. If we use the replace-insert-delete idiom, then we can use other functions as by comparing 2 equal size strings. I assume the full approach.
And, at the end, you need to find a shortest way through a graph.
I could present you one possible solution. A complete working code example.
By the way the graph is non-weigthed, because the cost of travelling from one node to the next is always 1. So actually we are talking about an undirected non-weighted graph.
The main algorithms we need using here are:
Levensthein. Calculate distance of 2 strings
and Breadth First Search, to find the shortes path through a graph
Please note, If the words should have the same length, then no Levensthein is needed. Just compare character by charcter and count the differences. That's rather simple. (But as said: The instructions are a little bit unclear)
Both algorithms can be modified. For example: You do not need a Levensthein distance greater than 1. You can terminate the distance calculation after distance one has been found. And, in the breadth first search, you could show the path, through which you are going.
OK, now, how to implement an undirected graph. There are 2 possibilities:
A Matrix (I will not explain)
A list or a vector
I would recommend the vector approach for this case. A matrix would be rather sparse, so, better a vector.
The basic data structure that you need is a node containing vertices and neighbours. So you have the word (a std::string) as vertex and the "neighbours". That is a std::vector of index positions to other nodes in the graph.
The graph is a vector of nodes. And nodes neighbours point to other nodes in this vector. We use the index into the vector to denote neighbours. All this we pack into a structure and call it "UndirectedGraph". We add a "build" function that checks for adjacencies. In this function we compare each string with any and check, if the difference is <2, so 0 or 1. 0 means equeal and 1 is a given constraint. If we find such a difference, we add it as neighboour in the corresponding node.
Additionally we add a breadth first search algorithm. It is described in Wikipedia
To ease up the implementation of that algortuhm we a a "visited" flag to the node.
Please see the code below:
#include <sstream>
#include <iostream>
#include <vector>
#include <string>
#include <iterator>
#include <iomanip>
#include <numeric>
#include <algorithm>
#include <queue>
std::istringstream textFileStream(R"#(peach
peace
place
plane
plans
plays
slays
stays
stars
sears
years
yearn
)#");
using Vertex = std::string;
using Edge = std::vector<size_t>;
// One node in a graph
struct Node
{
// The Vertex is a std::string
Vertex word{};
// The edges are the index of the neighbour nodes
Edge neighbour{};
// For Breath First Search
bool visited{ false };
// Easy input and output
friend std::istream& operator >> (std::istream& is, Node& n) {
n.neighbour.clear();
std::getline(is, n.word);
return is;
}
friend std::ostream& operator << (std::ostream& os, const Node& n) {
os << std::left << std::setw(25) << n.word << " --> ";
std::copy(n.neighbour.begin(), n.neighbour.end(), std::ostream_iterator<int>(os, " "));
return os;
}
};
// The graph
struct UndirectedGraph
{
// Contains a vector of nodes
std::vector<Node> graph;
// build adjacenies
void build();
// Find Path
bool checkForPathFromStartToEnd(size_t start, size_t end);
bool checkForPath() {bool result = false;if (graph.size() > 1) {size_t s = graph.size() - 2;size_t e = s + 1;result = checkForPathFromStartToEnd(s, e); }return result; }
// Easy input and output
friend std::istream& operator >> (std::istream& is, UndirectedGraph& ug) {
ug.graph.clear();
std::copy(std::istream_iterator<Node>(is), std::istream_iterator<Node>(), std::back_inserter(ug.graph));
return is;
}
friend std::ostream& operator << (std::ostream& os, const UndirectedGraph& ug) {
size_t i{ 0 };
for (const Node& n : ug.graph)
os << std::right << std::setw(4) << i++ << ' ' << n << '\n';
return os;
}
};
// Distance between 2 strings
size_t levensthein(const std::string& string1, const std::string& string2)
{
const size_t lengthString1(string1.size());
const size_t lengthString2(string2.size());
if (lengthString1 == 0) return lengthString2;
if (lengthString2 == 0) return lengthString1;
std::vector<size_t> costs(lengthString2 + 1);
std::iota(costs.begin(), costs.end(), 0);
for (size_t indexString1 = 0; indexString1 < lengthString1; ++indexString1) {
costs[0] = indexString1 + 1;
size_t corner = indexString1;
for (size_t indexString2 = 0; indexString2 < lengthString2; ++indexString2) {
size_t upper = costs[indexString2 + 1];
if (string1[indexString1] == string2[indexString2]) {
costs[indexString2 + 1] = corner;
}
else {
const size_t temp = std::min(upper, corner);
costs[indexString2 + 1] = std::min(costs[indexString2], temp) + 1;
}
corner = upper;
}
}
size_t result = costs[lengthString2];
return result;
}
// Build the adjacenies
void UndirectedGraph::build()
{
// Iterate over all words in the graph
for (size_t i = 0; i < graph.size(); ++i)
// COmpare everything with everything (becuase of symmetries, omit half of comparisons)
for (size_t j = i + 1; j < graph.size(); ++j)
// Chec distance of the 2 words to compare
if (levensthein(graph[i].word, graph[j].word) < 2U) {
// And store the adjacenies
graph[i].neighbour.push_back(j);
graph[j].neighbour.push_back(i);
}
}
bool UndirectedGraph::checkForPathFromStartToEnd(size_t start, size_t end)
{
// Assume that it will not work
bool result = false;
// Store intermediate tries in queue
std::queue<size_t> check{};
// Set initial values
graph[start].visited = true;
check.push(start);
// As long as we have not visited all possible nodes
while (!check.empty()) {
// Get the next node to check
size_t currentNode = check.front(); check.pop();
// If we found the solution . . .
if (currentNode == end) {
// The set resultung value and stop searching
result = true;
break;
}
// Go through all neighbours of the current node
for (const size_t next : graph[currentNode].neighbour) {
// If the neighbour node has not yet been visited
if (!graph[next].visited) {
// Then visit it
graph[next].visited = true;
// And check following elements next time
check.push(next);
}
}
}
return result;
}
int main()
{
// Get the filename from the user
std::cout << "Enter Filename for file with words:\n";
std::string filename{};
//std::cin >> filename;
// Open the file
//std::ifstream textFileStream(filename);
// If the file could be opened . . .
if (textFileStream) {
// Create an empty graph
UndirectedGraph undirectedGraph{};
// Read the complete file into the graph
textFileStream >> undirectedGraph;
Node startWord{}, targetWord{};
std::cout << "Enter start word and enter target word\n"; // teach --> learn
std::cin >> startWord >> targetWord;
// Add the 2 words at the and of our graph
undirectedGraph.graph.push_back(startWord);
undirectedGraph.graph.push_back(targetWord);
// Build adjacency graph, including the just added words
undirectedGraph.build();
// For debug purposes: Show the graph
std::cout << undirectedGraph;
std::cout << "\n\nMorph possible? --> " << std::boolalpha << undirectedGraph.checkForPath() << '\n';
}
else {
// File could not be found or opened
std::cerr << "Error: Could not open file : " << filename;
}
return 0;
}
Please note: Although I have implemented asking for a file name, I do not use it in this example. I read from a istringstream. You need to delete the istringstream later and comment in the existing statements.
Reagarding the requirements from the instructor: I did not use any STL/Library/Boost searching algorithm. (What for? In this example?) But I use of course other C++ STL container. I will not reinvent the wheel and come up with a new "vector" or queue. And I will definitely not use "new" or C-Style arrays or pointer arithmetic.
Have fun!
And to all others: Sorry: I could not resist to write the code . . .
In SQL there is the feature to say something like
SELECT TOP 20 distance FROM dbFile ORDER BY distance ASC
If my SQL is correct with, say 10,000 records, this should return the 20 smallest distances in my databse.
I don't have a database. I have a 100,000-element simple array.
Is there a C++ container, Boost, MFC or STL that provides simple code for a struct like
struct closest{
int ID;
double distance;
closest():ID(-1), distance(std::numeric_limits<double>::max( )){}
};
Where I can build a sorted by distance container like
boost::container::XXXX<closest> top(20);
And then have a simple
top.replace_if(closest(ID,Distance));
Where the container will replace the entry with the current highest distance in my container with my new entry if it is less than the current highest distance in my container.
I am not worried about speed. I like elegant clean solutions where containers and code do all the heavy lifting.
EDIT. Addendum after all the great answers received.
What I really would of liked to have found, due to its elegance. Is a sorted container that I could create with a container size limit. In my case 20. Then I could push or insert to my hearts content a 100 000 items or more. But. There is always a but. The container would of maintained the max size of 20 by replacing or not inserting an item if its comparator value was not within the lowest 20 values.
Yes. I know now from all these answers that via programming and tweaking existing containers the same effect can be achieved. Perhaps when the next round of suggestions for the C & C++ standards committee sits. We could suggest. Self sorting (which we kind of have already) and self size limiting containers.
What you need is to have a maxheap of size 20. Recall that the root of your heap will be the largest value in the heap.
This heap will contain the records with smallest distance that you have encountered so far. For the first 20 out of 10000 values you just push to the heap.
At this point you iterate through the rest of the records and for each record, you compare it to the root of your heap.
Remember that the root of your heap is basically the very worst of the very best.(The record with the largest distance, among the 20 records with the shortest distance you have encountered so far).
If the value you are considering is not worth keeping (its distance is larger that the root of your tree), ignore that record and just keep moving.
Otherwise you pop your heap (get rid of the root) and push the new value in. The priority queue will automatically put its record with the largest distance on the root again.
Once you keep doing this over the entire set of 10000 values, you will be left with the 20 records that have the smallest distance, which is what you want.
Each push-pop takes constant O(1) time, iterating through all inputs of N is O(n) so this is a Linear solution.
Edit:
I thought it would be useful to show my idea in C++ code. This is a toy example, you can write a generic version with templates but I chose to keep it simple and minimalistic:
#include <iostream>
#include <queue>
using namespace std;
class smallestElements
{
private:
priority_queue<int,std::vector<int>,std::less<int> > pq;
int maxSize;
public:
smallestElements(int size): maxSize(size)
{
pq=priority_queue<int, std::vector<int>, std::less<int> >();
}
void possiblyAdd(int newValue)
{
if(pq.size()<maxSize)
{
pq.push(newValue);
return;
}
if(newValue < pq.top())
{
pq.pop(); //get rid of the root
pq.push(newValue); //priority queue will automatically restructure
}
}
void printAllValues()
{
priority_queue<int,std::vector<int>,std::less<int> > cp=pq;
while(cp.size()!=0)
{
cout<<cp.top()<<" ";
cp.pop();
}
cout<<endl;
}
};
How you use this is really straight forward. basically in your main function somewhere you will have:
smallestElements se(20); //we want 20 smallest
//...get your stream of values from wherever you want, call the int x
se.possiblyAdd(x); //no need for bounds checking or anything fancy
//...keep looping or potentially adding until the end
se.printAllValues();//shows all the values in your container of smallest values
// alternatively you can write a function to return all values if you want
If this is about filtering the 20 smallest elements from a stream on the fly, then a solution based on std::priority_queue (or std::multiset) is the way to go.
However, if it is about finding the 20 smallest elements in a given arraym I wouldn't go for a special container at all, but simply the algorithm std::nth_element - a partial sorting algorithm that will give you the n smallest elements - EDIT: or std::partial_sort (thanks Jarod42) if the elements also have to be sorted. It has linear complexity and it's just a single line to write (+ the comparison operator, which you need in any case):
#include <vector>
#include <iostream>
#include <algorithm>
struct Entry {
int ID;
double distance;
};
std::vector<Entry> data;
int main() {
//fill data;
std::nth_element(data.begin(), data.begin() + 19, data.end(),
[](auto& l, auto& r) {return l.distance < r.distance; });
std::cout << "20 elements with smallest distance: \n";
for (size_t i = 0; i < 20; ++i) {
std::cout << data[i].ID << ":" << data[i].distance << "\n";
}
std::cout.flush();
}
If you don't want to change the order of your original array, you would have to make a copy of the whole array first though.
My first idea would be using a std::map or std::set with a custom comparator for this (edit: or even better, a std::priority_queue as mentioned in the comments).
Your comparator does your sorting.
You essentially add all your elements to it. After an element has been added, check whether there are more than n elements inside. If there are, remove the last one.
I am not 100% sure, that there is no more elegant solution, but even std::set is pretty pretty.
All you have to do is to define a proper comparator for your elements (e.g. a > operator) and then do the following:
std::set<closest> tops(arr, arr+20)
tops.insert(another);
tops.erase(tops.begin());
I would use nth_element like #juanchopanza suggested before he deleted it.
His code looked like:
bool comp(const closest& lhs, const closest& rhs)
{
return lhs.distance < rhs.distance;
}
then
std::vector<closest> v = ....;
nth_element(v.begin(), v.begin() + 20, v.end(), comp);
Though if it was only ever going to be twenty elements then I would use a std::array.
Just so you can all see what I am currently doing which seems to work.
struct closest{
int base_ID;
int ID;
double distance;
closest(int BaseID, int Point_ID,
double Point_distance):base_ID(BaseID),
ID(Point_ID),distance(Point_distance){}
closest():base_ID(-1), ID(-1),
distance(std::numeric_limits<double>::max( )){}
bool operator<(const closest& rhs) const
{
return distance < rhs.distance;
}
};
void calc_nearest(void)
{
boost::heap::priority_queue<closest> svec;
for (int current_gift = 0; current_gift < g_nVerticesPopulated; ++current_gift)
{ double best_distance=std::numeric_limits<double>::max();
double our_distance=0.0;
svec.clear();
for (int all_other_gifts = 0; all_other_gifts < g_nVerticesPopulated;++all_other_gifts)
{
our_distance = distanceVincenty(g_pVertices[current_gift].lat,g_pVertices[current_gift].lon,g_pVertices[all_other_gifts].lat,g_pVertices[all_other_gifts].lon);
if (our_distance != 0.0)
{
if (our_distance < best_distance) // don't bother to push and sort if the calculated distance is greater than current 20th value
svec.push(closest(g_pVertices[current_gift].ID,g_pVertices[current_gift].ID,our_distance));
if (all_other_gifts%100 == 0)
{
while (svec.size() > no_of_closest_points_to_calculate) svec.pop(); // throw away any points above no_of_closest_points_to_calculate
closest t = svec.top(); // the furthest of the no_of_closest_points_to_calculate points for optimisation
best_distance = t.distance;
}
}
}
std::cout << current_gift << "\n";
}
}
As you can see. I have 100 000 lat & long points draw on an openGl sphere.
I am calculating each point against every other point and only retaining currently the closest 20 points. There is some primitive optimisation going on by not pushing a value if it is bigger than the 20th closest point.
As I am used to Prolog taking hours to solve something I am not worried about speed. I shall run this overnight.
Thanks to all for your help.
It is much appreciated.
Still have to audit the code and results but happy that I am moving in the right direction.
I have posted a number of approaches to the similar problem of retrieving the top 5 minimum values recently here:
https://stackoverflow.com/a/33687969/1025391
There are implementations that keep a specific number of smallest or greatest items from an input vector in different ways. The nth_element algorithm performs a partial sort, the priority queue maintains a heap, the set a binary search tree, and the deque- and vector-based approaches just remove an element based on a (linear) min/max search.
It should be fairly easy to implement a custom comparison operator and to adapt the number of items to keep n.
Here's the code (refactored based off the other post):
#include <algorithm>
#include <functional>
#include <queue>
#include <set>
#include <vector>
#include <random>
#include <iostream>
#include <chrono>
template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_nth_element(std::vector<T> v, typename std::vector<T>::size_type n) {
auto target = v.begin()+n;
std::nth_element(v.begin(), target, v.end(), Compare());
std::vector<T> result(v.begin(), target);
return result;
}
template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_pqueue(std::vector<T> v, typename std::vector<T>::size_type n) {
std::vector<T> result;
std::priority_queue<T, std::vector<T>, Compare> q;
for (auto i: v) {
q.push(i);
if (q.size() > n) {
q.pop();
}
}
while (!q.empty()) {
result.push_back(q.top());
q.pop();
}
return result;
}
template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_set(std::vector<T> v, typename std::vector<T>::size_type n) {
std::set<T, Compare> s;
for (auto i: v) {
s.insert(i);
if (s.size() > n) {
s.erase(std::prev(s.end()));
}
}
return std::vector<T>(s.begin(), s.end());
}
template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_deque(std::vector<T> v, typename std::vector<T>::size_type n) {
std::deque<T> q;
for (auto i: v) {
q.push_back(i);
if (q.size() > n) {
q.erase(std::max_element(q.begin(), q.end(), Compare()));
}
}
return std::vector<T>(q.begin(), q.end());
}
template <typename T, typename Compare = std::less<T>>
std::vector<T> filter_vector(std::vector<T> v, typename std::vector<T>::size_type n) {
std::vector<T> q;
for (auto i: v) {
q.push_back(i);
if (q.size() > n) {
q.erase(std::max_element(q.begin(), q.end(), Compare()));
}
}
return q;
}
template <typename Clock = std::chrono::high_resolution_clock>
struct stopclock {
std::chrono::time_point<Clock> start;
stopclock() : start(Clock::now()) {}
~stopclock() {
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(Clock::now() - start);
std::cout << "elapsed: " << elapsed.count() << "ms\n";
}
};
std::vector<int> random_data(std::vector<int>::size_type n) {
std::mt19937 gen{std::random_device()()};
std::uniform_int_distribution<> dist;
std::vector<int> out(n);
for (auto &i: out)
i = dist(gen);
return out;
}
int main() {
std::vector<int> data = random_data(1000000);
stopclock<> sc;
std::vector<int> result = filter_nth_element(data, 5);
std::cout << "smallest values: ";
for (auto i : result) {
std::cout << i << " ";
}
std::cout << "\n";
std::cout << "largest values: ";
result = filter_nth_element<int, std::greater<int>>(data, 5);
for (auto i : result) {
std::cout << i << " ";
}
std::cout << "\n";
}
Example output is:
$ g++ test.cc -std=c++11 && ./a.out
smallest values: 4433 2793 2444 4542 5557
largest values: 2147474453 2147475243 2147477379 2147469788 2147468894
elapsed: 123ms
Note that in this case only the position of the nth element is accurate with respect to the order imposed by the provided comparison operator. The other elements are guaranteed to be smaller/greater or equal to that one however, depending on the comparison operator provided. That is, the top n min/max elements are returned, but they are not correctly sorted.
Don't expect the other algorithms to produce results in a specific order either. (While the approaches using priority queue and set actually produce sorted output, their results have the opposite order).
For reference:
http://en.cppreference.com/w/cpp/algorithm/nth_element
http://en.cppreference.com/w/cpp/container/priority_queue
http://en.cppreference.com/w/cpp/container/set
http://en.cppreference.com/w/cpp/algorithm/max_element
I actually have 100 000 Lat & Lon points drawn on a opengl sphere. I want to work out the 20 nearest points to each of the 100 000 points. So we have two loops to pick each point then calculate that point against every other point and save the closest 20 points.
This reads as if you want to perform a k-nearest neighbor search in the first place. For this, you usually use specialized data structures (e.g., a binary search tree) to speed up the queries (especially when you are doing 100k of them).
For spherical coordinates you'd have to do a conversion to a cartesian space to fix the coordinate wrap-around. Then you'd use an Octree or kD-Tree.
Here's an approach using the Fast Library for Approximate Nearest Neighbors (FLANN):
#include <vector>
#include <random>
#include <iostream>
#include <flann/flann.hpp>
#include <cmath>
struct Point3d {
float x, y, z;
void setLatLon(float lat_deg, float lon_deg) {
static const float r = 6371.; // sphere radius
float lat(lat_deg*M_PI/180.), lon(lon_deg*M_PI/180.);
x = r * std::cos(lat) * std::cos(lon);
y = r * std::cos(lat) * std::sin(lon);
z = r * std::sin(lat);
}
};
std::vector<Point3d> random_data(std::vector<Point3d>::size_type n) {
static std::mt19937 gen{std::random_device()()};
std::uniform_int_distribution<> dist(0, 36000);
std::vector<Point3d> out(n);
for (auto &i: out)
i.setLatLon(dist(gen)/100., dist(gen)/100.);
return out;
}
int main() {
// generate random spherical point cloud
std::vector<Point3d> data = random_data(1000);
// generate query point(s) on sphere
std::vector<Point3d> query = random_data(1);
// convert into library datastructures
auto mat_data = flann::Matrix<float>(&data[0].x, data.size(), 3);
auto mat_query = flann::Matrix<float>(&query[0].x, query.size(), 3);
// build KD-Tree-based index data structure
flann::Index<flann::L2<float> > index(mat_data, flann::KDTreeIndexParams(4));
index.buildIndex();
// perform query: approximate nearest neighbor search
int k = 5; // number of neighbors to find
std::vector<std::vector<int>> k_indices;
std::vector<std::vector<float>> k_dists;
index.knnSearch(mat_query, k_indices, k_dists, k, flann::SearchParams(128));
// k_indices now contains for each query point the indices to the neighbors in the original point cloud
// k_dists now contains for each query point the distances to those neighbors
// removed printing of results for brevity
}
You'd receive results similar to this one (click to enlarge):
For reference:
https://en.wikipedia.org/wiki/Nearest_neighbor_search
https://en.wikipedia.org/wiki/Octree
https://en.wikipedia.org/wiki/Kd-tree
http://www.cs.ubc.ca/research/flann/
Heap is the data structure that you need. pre-C++11 stl only had functions which managed heap data in your own arrays. Someone mentioned that boost has a heap class, but you don't need to go so far as to use boost if your data is simple integers. stl's heap will do just fine. And, of course, the algorithm is to order the heap so that the highest value is the first one. So with each new value, you push it on the heap, and, once the heap reaches 21 elements in size, you pop the first value from the heap. This way whatever 20 values remain are always the 20 lowest.
I've got the following problem. I have a game which runs on average 60 frames per second. Each frame I need to store values in a container and there must be no duplicates.
It probably has to store less than 100 items per frame, but the number of insert-calls will be alot more (and many rejected due to it has to be unique). Only at the end of the frame do I need to traverse the container. So about 60 iterations of the container per frame, but alot more insertions.
Keep in mind the items to store are simple integer.
There are a bunch of containers I can use for this but I cannot make up my mind what to pick. Performance is the key issue for this.
Some pros/cons that I've gathered:
vector
(PRO): Contigous memory, a huge factor.
(PRO): Memory can be reserved first, very few allocations/deallocations afterwards
(CON): No alternative than to traverse the container (std::find) each insert() to find unique keys? The comparison is simple though (integers) and the whole container can probably fit the cache
set
(PRO): Simple, clearly meant for this
(CON): Not constant insert-time
(CON): Alot of allocations/deallocations per frame
(CON): Not contigous memory. Traversing a set of hundreds of objects means jumping around alot in memory.
unordered_set
(PRO): Simple, clearly meant for this
(PRO): Average case constant time insert
(CON): Seeing as I store integers, hash operation is probably alot more expensive than anything else
(CON): Alot of allocations/deallocations per frame
(CON): Not contigous memory. Traversing a set of hundreds of objects means jumping around alot in memory.
I'm leaning on going the vector-route because of memory access patterns, even though set is clearly meant for this issue. The big issue that is unclear to me is whether traversing the vector for each insert is more costly than the allocations/deallocations (especially considering how often this must be done) and the memory lookups of set.
I know ultimately it all comes down to profiling each case, but if nothing else than as a headstart or just theoretically, what would probably be best in this scenario? Are there any pros/cons I might've missed aswell?
EDIT: As I didnt mention, the container is cleared() at the end of each frame
I did timing with a few different methods that I thought were likely candidates. Using std::unordered_set was the winner.
Here are my results:
Using UnorderedSet: 0.078s
Using UnsortedVector: 0.193s
Using OrderedSet: 0.278s
Using SortedVector: 0.282s
Timing is based on the median of five runs for each case.
compiler: gcc version 4.9.1
flags: -std=c++11 -O2
OS: ubuntu 4.9.1
CPU: Intel(R) Core(TM) i5-4690K CPU # 3.50GHz
Code:
#include <algorithm>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <random>
#include <set>
#include <unordered_set>
#include <vector>
using std::cerr;
static const size_t n_distinct = 100;
template <typename Engine>
static std::vector<int> randomInts(Engine &engine,size_t n)
{
auto distribution = std::uniform_int_distribution<int>(0,n_distinct);
auto generator = [&]{return distribution(engine);};
auto vec = std::vector<int>();
std::generate_n(std::back_inserter(vec),n,generator);
return vec;
}
struct UnsortedVectorSmallSet {
std::vector<int> values;
static const char *name() { return "UnsortedVector"; }
UnsortedVectorSmallSet() { values.reserve(n_distinct); }
void insert(int new_value)
{
auto iter = std::find(values.begin(),values.end(),new_value);
if (iter!=values.end()) return;
values.push_back(new_value);
}
};
struct SortedVectorSmallSet {
std::vector<int> values;
static const char *name() { return "SortedVector"; }
SortedVectorSmallSet() { values.reserve(n_distinct); }
void insert(int new_value)
{
auto iter = std::lower_bound(values.begin(),values.end(),new_value);
if (iter==values.end()) {
values.push_back(new_value);
return;
}
if (*iter==new_value) return;
values.insert(iter,new_value);
}
};
struct OrderedSetSmallSet {
std::set<int> values;
static const char *name() { return "OrderedSet"; }
void insert(int new_value) { values.insert(new_value); }
};
struct UnorderedSetSmallSet {
std::unordered_set<int> values;
static const char *name() { return "UnorderedSet"; }
void insert(int new_value) { values.insert(new_value); }
};
int main()
{
//using SmallSet = UnsortedVectorSmallSet;
//using SmallSet = SortedVectorSmallSet;
//using SmallSet = OrderedSetSmallSet;
using SmallSet = UnorderedSetSmallSet;
auto engine = std::default_random_engine();
std::vector<int> values_to_insert = randomInts(engine,10000000);
SmallSet small_set;
namespace chrono = std::chrono;
using chrono::system_clock;
auto start_time = system_clock::now();
for (auto value : values_to_insert) {
small_set.insert(value);
}
auto end_time = system_clock::now();
auto& result = small_set.values;
auto sum = std::accumulate(result.begin(),result.end(),0u);
auto elapsed_seconds = chrono::duration<float>(end_time-start_time).count();
cerr << "Using " << SmallSet::name() << ":\n";
cerr << " sum=" << sum << "\n";
cerr << " elapsed: " << elapsed_seconds << "s\n";
}
I'm going to put my neck on the block here and suggest that the vector route is probably most efficient when the size is 100 and the objects being stored are integral values. The simple reason for this is that set and unordered_set allocate memory for each insert whereas the vector needn't more than once.
You can increase search performance dramatically by keeping the vector ordered, since then all searches can be binary searches and therefore complete in log2N time.
The downside is that the inserts will take a tiny fraction longer due to the memory moves, but it sounds as if there will be many more searches than inserts, and moving (average) 50 contiguous memory words is an almost instantaneous operation.
Final word:
Write the correct logic now. Worry about performance when the users are complaining.
EDIT:
Because I couldn't help myself, here's a reasonably complete implementation:
template<typename T>
struct vector_set
{
using vec_type = std::vector<T>;
using const_iterator = typename vec_type::const_iterator;
using iterator = typename vec_type::iterator;
vector_set(size_t max_size)
: _max_size { max_size }
{
_v.reserve(_max_size);
}
/// #returns: pair of iterator, bool
/// If the value has been inserted, the bool will be true
/// the iterator will point to the value, or end if it wasn't
/// inserted due to space exhaustion
auto insert(const T& elem)
-> std::pair<iterator, bool>
{
if (_v.size() < _max_size) {
auto it = std::lower_bound(_v.begin(), _v.end(), elem);
if (_v.end() == it || *it != elem) {
return make_pair(_v.insert(it, elem), true);
}
return make_pair(it, false);
}
else {
return make_pair(_v.end(), false);
}
}
auto find(const T& elem) const
-> const_iterator
{
auto vend = _v.end();
auto it = std::lower_bound(_v.begin(), vend, elem);
if (it != vend && *it != elem)
it = vend;
return it;
}
bool contains(const T& elem) const {
return find(elem) != _v.end();
}
const_iterator begin() const {
return _v.begin();
}
const_iterator end() const {
return _v.end();
}
private:
vec_type _v;
size_t _max_size;
};
using namespace std;
BOOST_AUTO_TEST_CASE(play_unique_vector)
{
vector_set<int> v(100);
for (size_t i = 0 ; i < 1000000 ; ++i) {
v.insert(int(random() % 200));
}
cout << "unique integers:" << endl;
copy(begin(v), end(v), ostream_iterator<int>(cout, ","));
cout << endl;
cout << "contains 100: " << v.contains(100) << endl;
cout << "contains 101: " << v.contains(101) << endl;
cout << "contains 102: " << v.contains(102) << endl;
cout << "contains 103: " << v.contains(103) << endl;
}
As you said you have many insertions and only one traversal, I’d suggest to use a vector and push the elements in regardless of whether they are unique in the vector. This is done in O(1).
Just when you need to go through the vector, then sort it and remove the duplicate elements. I believe this can be done in O(n) as they are bounded integers.
EDIT: Sorting in linear time through counting sort presented in this video. If not feasible, then you are back to O(n lg(n)).
You will have very little cache miss because of the contiguity of the vector in memory, and very few allocations (especially if you reserve enough memory in the vector).
I wonder if anybody could help me out.
I look for a data structure (such as list, queue, stack, array, vector, binary tree etc.) supporting these four operations:
isEmpty (true/false)
insert single element
pop (i.e. get&remove) single element
split into two structures e.g. take a approximately half (let's say +/- 20%) of elements and move them to another structure
Note that I don't care about order of elements at all.
Insert/pop example:
A.insert(1), A.insert(2), A.insert(3), A.insert(4), A.insert(5) // contains 1,2,3,4,5 in any order
A.pop() // 3
A.pop() // 2
A.pop() // 5
A.pop() // 1
A.pop() // 4
and the split example:
A.insert(1), A.insert(2), A.insert(3), A.insert(4), A.insert(5)
A.split(B)
// A = {1,4,3}, B={2,5} in any order
I need the structure to be be fast as possible - preferably all four operations in O(1). I doubt it have been already implemented in std so I will implement it by myself (in C++11, so std::move can be used).
Note that insert, pop and isEmpty are called about ten times more frequently than split.
I tried some coding with list and vector but with no success:
#include <vector>
#include <iostream>
// g++ -Wall -g -std=c++11
/*
output:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
5 6 7 8 9
*/
int main ()
{
std::vector<int> v1;
for (int i = 0; i < 10; ++i) v1.push_back(i);
for (auto i : v1) std::cout << i << " ";
std::cout << std::endl;
auto halfway = v1.begin() + v1.size() / 2;
auto endItr = v1.end();
std::vector<int> v2;
v2.insert(v2.end(),
std::make_move_iterator(halfway),
std::make_move_iterator(endItr));
// sigsegv
/*
auto halfway2 = v1.begin() + v1.size() / 2;
auto endItr2 = v1.end();
v2.erase(halfway2, endItr2);
*/
for (auto i : v1) std::cout << i << " ";
std::cout << std::endl;
for (auto i : v2) std::cout << i << " ";
std::cout << std::endl;
return 0;
}
Any sample code, ideas, links or whatever useful? Thanks
Related literature:
How to move the later half of a vector into another vector? (actually does not work due to deletetion problem)
http://www.cplusplus.com/reference/iterator/move_iterator/
Your problems with the deletion aare due to a bug in your code.
// sigsegv
auto halfway2 = v1.begin() + v1.size() / 2;
auto endItr2 = v1.end();
v2.erase(halfway2, endItr2);
You try to erase from v2 with iterators pointing into v1. That won't work and you probably wanted to callerase on v1.
That fixes your deletion problem when splitting the vector, and vector seems to be the best container for what you want.
Note that everything except split can be done in O(1) on a vector if you insert at the end only, but since order doesn't matter for you I don't see any problem with it, split would be O(n) in your implemention once you fixed it, but that should be pretty fast since the data is right next to eachother in the vector and that's very cache friendly.
I can’t think of a solution with all operations in O(1).
With a list you can have push and pop in O(1), and split in O(n) (due to the fact that you need to find the middle of the list).
With a balanced binary tree (not a search tree) you can have all operations in O(log n).
edit
There have been some suggestions that keeping the middle of the list would produce O(1). This is not the case as when you split the function you have to compute the middle of the left list and the middle of the right list resulting in O(n).
Some other suggestion is that a vector is preferred simply because it is cache-friendly. I totally agree with this.
For fun, I implemented a balanced binary tree container that performs all operations in O(log n). The insert and pop are obviously in O(log n). The actual split is in O(1), however we are left with the root node which we have to insert in one of the halves resulting in O(log n) for split also. No copying is involved however.
Here is my attempt at the said container (I haven’t thoroughly tested for correctness, and it can be further optimized (like transforming the recursion in a loop)).
#include <memory>
#include <iostream>
#include <utility>
#include <exception>
template <class T>
class BalancedBinaryTree {
private:
class Node;
std::unique_ptr<Node> root_;
public:
void insert(const T &data) {
if (!root_) {
root_ = std::unique_ptr<Node>(new Node(data));
return;
}
root_->insert(data);
}
std::size_t getSize() const {
if (!root_) {
return 0;
}
return 1 + root_->getLeftCount() + root_->getRightCount();
}
// Tree must not be empty!!
T pop() {
if (root_->isLeaf()) {
T temp = root_->getData();
root_ = nullptr;
return temp;
}
return root_->pop()->getData();
}
BalancedBinaryTree split() {
if (!root_) {
return BalancedBinaryTree();
}
BalancedBinaryTree left_half;
T root_data = root_->getData();
bool left_is_bigger = root_->getLeftCount() > root_->getRightCount();
left_half.root_ = std::move(root_->getLeftChild());
root_ = std::move(root_->getRightChild());
if (left_is_bigger) {
insert(root_data);
} else {
left_half.insert(root_data);
}
return std::move(left_half);
}
};
template <class T>
class BalancedBinaryTree<T>::Node {
private:
T data_;
std::unique_ptr<Node> left_child_, right_child_;
std::size_t left_count_ = 0;
std::size_t right_count_ = 0;
public:
Node() = default;
Node(const T &data, std::unique_ptr<Node> left_child = nullptr,
std::unique_ptr<Node> right_child = nullptr)
: data_(data), left_child_(std::move(left_child)),
right_child_(std::move(right_child)) {
}
bool isLeaf() const {
return left_count_ + right_count_ == 0;
}
const T& getData() const {
return data_;
}
T& getData() {
return data_;
}
std::size_t getLeftCount() const {
return left_count_;
}
std::size_t getRightCount() const {
return right_count_;
}
std::unique_ptr<Node> &getLeftChild() {
return left_child_;
}
const std::unique_ptr<Node> &getLeftChild() const {
return left_child_;
}
std::unique_ptr<Node> &getRightChild() {
return right_child_;
}
const std::unique_ptr<Node> &getRightChild() const {
return right_child_;
}
void insert(const T &data) {
if (left_count_ <= right_count_) {
++left_count_;
if (left_child_) {
left_child_->insert(data);
} else {
left_child_ = std::unique_ptr<Node>(new Node(data));
}
} else {
++right_count_;
if (right_child_) {
right_child_->insert(data);
} else {
right_child_ = std::unique_ptr<Node>(new Node(data));
}
}
}
std::unique_ptr<Node> pop() {
if (isLeaf()) {
throw std::logic_error("pop invalid path");
}
if (left_count_ > right_count_) {
--left_count_;
if (left_child_->isLeaf()) {
return std::move(left_child_);
}
return left_child_->pop();
}
--right_count_;
if (right_child_->left_count_ == 0 && right_child_->right_count_ == 0) {
return std::move(right_child_);
}
return right_child_->pop();
}
};
usage:
BalancedBinaryTree<int> t;
BalancedBinaryTree<int> t2;
t.insert(3);
t.insert(7);
t.insert(17);
t.insert(37);
t.insert(1);
t2 = t.split();
while (t.getSize() != 0) {
std::cout << t.pop() << " ";
}
std::cout << std::endl;
while (t2.getSize() != 0) {
std::cout << t2.pop() << " ";
}
std::cout << std::endl;
output:
1 17
3 37 7
If the number of elements/bytes stored at any one time in your container is large, the solution of Youda008 (using a list and keeping track of the middle) may not be as efficient as you hope.
Alternatively, you could have a list<vector<T>> or even list<array<T,Capacity>> and keep track of the middle of the list, i.e. split only between two sub-containers, but never split a sub-container. This should give you both O(1) on all operations and reasonable cache efficiency. Use array<T,Capacity> if a single value for Capacity serves your needs at all times (for Capacity=1, this reverts to an ordinary list).
Otherwise, use vector<T> and adapt the capacity for new vectors according to demand.
bolov's points out correctly that finding the middles of the lists emerging from splitting one list is not O(1). This implies that keeping track of the middle is not useful. However, using a list<sub_container> is still faster than list, because the split only costs O(n/Capacity) not O(n). The price you pay for this is that the split has a graininess of Capacity rather than 1. Thus, you must compromise between the accuracy and cost of a split.
Another option is to implement own container using a linked list and a pointer to that middle element, at which you want to split it. This pointer will be updated on every modifying operation. This way you can achieve O(1) complexicity on all operations.
I have a set of ranges :
Range1 ---- (0-10)
Range2 ---- (15-25)
Range3 ---- (100-1000) and likewise.
I would like to have only the bounds stored since storing large ranges , it would be efficient.
Now I need to search for a number , say 14 . In this case, 14 is not present in any of the ranges whereas (say a number) 16 is present in one of the ranges.
I would need a function
bool search(ranges, searchvalue)
{
if searchvalues present in any of the ranges
return true;
else
return false;
}
How best can this be done ? This is strictly non-overlapping and the important criteria is that the search has to be most efficient.
One possibility is to represent ranges as a pair of values and define a suitable comparison function. The following should consider one range less than another if its bounds are smaller and there is no overlap. As a side effect, this comparison function doesn't let you store overlapping ranges in the set.
To look up an integer n, it can be treated as a range [n, n]
#include <set>
#include <iostream>
typedef std::pair<int, int> Range;
struct RangeCompare
{
//overlapping ranges are considered equivalent
bool operator()(const Range& lhv, const Range& rhv) const
{
return lhv.second < rhv.first;
}
};
bool in_range(const std::set<Range, RangeCompare>& ranges, int value)
{
return ranges.find(Range(value, value)) != ranges.end();
}
int main()
{
std::set<Range, RangeCompare> ranges;
ranges.insert(Range(0, 10));
ranges.insert(Range(15, 25));
ranges.insert(Range(100, 1000));
std::cout << in_range(ranges, 14) << ' ' << in_range(ranges, 16) << '\n';
}
The standard way to handle this is through so called interval trees. Basically, you augment an ordinary red-black tree with additional information so that each node x contains an interval x.int and the key of x is the low endpoint, x.int.low, of the interval. Each node x also contains a value x.max, which is the maximum value of any interval endpoint stored in the subtree rooted at x. Now you can determine x.max given interval x.int and the max values of node x’s children as follows:
x.max = max(x.int.high, x.left.max, x.right.max)
This implies that, with n intervals, insertion and deletion run in O(lg n) time. In fact, it is possible to update the max attributes after a rotation in O(1) time. Here is how to search for an element i in the interval tree T
INTERVAL-SEARCH(T, i)
x = T:root
while x is different from T.nil and i does not overlap x.int
if x.left is different from T.nil and x.left.max is greater than or equal to i.low
x = x.left
else
x = x.right
return x
The complexity of the search procedure is O(lg n) as well.
To see why, see CLRS Introduction to algorithms, chapter 14 (Augmenting Data Structures).
You could put something together based on std::map and std::map::upper_bound:
Assuming you have
std::map<int,int> ranges; // key is start of range, value is end of range
You could do the following:
bool search(const std::map<int,int>& ranges, int searchvalue)
{
auto p = ranges.upper_bound(searchvalue);
// p->first > searchvalue
if(p == ranges.begin())
return false;
--p; // p->first <= searchvalue
return searchvalue >= p->first && searchvalue <= p->second;
}
I'm using C++11, if you use C++03, you'll need to replace "auto" by the proper iterator type.
EDIT: replaced pseudo-code inrange() by explicit expression in return statement.
A good solution can be as the following. It is O(log(n)).
A critical condition is non overlapping ranges.
#include <set>
#include <iostream>
#include <assert.h>
template <typename T> struct z_range
{
T s , e ;
z_range ( T const & s,T const & e ) : s(s<=e?s:e), e(s<=e?e:s)
{
}
};
template <typename T> bool operator < (z_range<T> const & x , z_range<T> const & y )
{
if ( x.e<y.s)
return true ;
return false ;
}
int main(int , char *[])
{
std::set<z_range<int> > x;
x.insert(z_range<int>(20,10));
x.insert(z_range<int>(30,40));
x.insert(z_range<int>(5,9));
x.insert(z_range<int>(45,55));
if (x.find(z_range<int>(15,15)) != x.end() )
std::cout << "I have it" << std::endl ;
else
std::cout << "not exists" << std::endl ;
}
If you have ranges ri = [ai, bi]. You could sort all the ai and put them into an array and search for x having x >= ai and ai minimal using binary search.
After you found this element you have to check whether x <= bi.
This is suitable if you have big numbers. If, on the other hand, you have either a lot of memory or small numbers, you can think about putting those ranges into a bool array. This may be suitable if you have a lot of queries:
bool ar[];
ar[0..10] = true;
ar[15..25] = true;
// ...
bool check(int searchValues) {
return ar[searchValues];
}
Since the ranges are non-overlapping the only thing left to do is performing a search within the range that fit's the value. If the values are ordered within the ranges, searching is even simpler. Here is a summary of search algorithms.
With respect to C++ you also can use algorithms from STL or even functions provided by the containers, e. g. set::find.
So, this assumes the ranges are continous (i.e range [100,1000] contains all numbers between 100 and 1000):
#include <iostream>
#include <map>
#include <algorithm>
bool is_in_ranges(std::map<int, int> ranges, int value)
{
return
std::find_if(ranges.begin(), ranges.end(),
[&](std::pair<int,int> pair)
{
return value >= pair.first && value <= pair.second;
}
) != ranges.end();
}
int main()
{
std::map<int, int> ranges;
ranges[0] = 10;
ranges[15] = 25;
ranges[100] = 1000;
std::cout << is_in_ranges(ranges, 14) << '\n'; // 0
std::cout << is_in_ranges(ranges, 16) << '\n'; // 1
}
In C++03, you'd need a functor instead of a lambda function:
struct is_in {
is_in(int x) : value(x) {}
bool operator()(std::pair<int, int> pair)
{
return value >= pair.first && value <= pair.second;
}
private:
int value;
};
bool is_in_ranges(std::map<int, int> ranges, int value)
{
return
std::find_if(ranges.begin(), ranges.end(), is_in(value)) != ranges.end();
}