Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I have a std::map object map<string , Property*> _propertyMap, where string is the property's name and Property* contains the property values.
I need to process the properties values and convert them to a specific data format- each property has its own format, e.g.. if the map initialization is as following:
_propertyMap["id"] = new Property(Property::UUID, "12345678");
_propertyMap["name"] = new Property(Property::STRING, "name");
....
then "id" should be processed differently than "name" etc.
This means that I need to look for each property in the map and process its values accordingly.
I thought about two ways to do that.
One, use std::map::find method to get a specific property, like that:
map<string , Property*>::iterator it1 = _propertyMap.find("id");
if(it1 != _propertyMap.end())
{
//element found - process id values
}
map<string , Property*>::iterator it2 = _propertyMap.find("name");
if(it2 != _propertyMap.end())
{
//element found - process name values
}
....
Two, iterate the map and for each entry check what the property's name is and proceed accordingly:
for (it = _propertyMap.begin(); it != _propertyMap.end(); ++it )
{
//if it is events - append the values to the matching nodes
if (it->first == "id")
{
//process id values
}
else if (it->first == "name")
{
//process name values
}
.....
}
Given that the Time complexity of std::map::find is O(logN), the complexity of the first solution is O(NlogN). I'm not sure about the complexity of the second solution, because it iterates the map once (O(N)), but performs a lot of if-else each iteration. I tried to google common map::find() questions, but couldn't find any useful information; most of them just need to get one value from the map, and then find() does this with better complexity (O(logN) vs O(N)).
What is a better approach? or perhaps there is another one which I didn't think of?
Also, code styling speaking, which one is more good and clear code?
I see a few different use-cases here, depending on what you have in mind:
Fixed properties
(Just for completeness, i guess it is not what you want) If both name and type of possible properties should be fixed, the best version is to use a simple class/struct, possibly using boost::optional (std::optional with C++17) for values that might be present or not
struct Data{
int id = 0;
std::string name = "";
boost::optional<int> whatever = boost::none;
}
Pros:
All "lookups" are resolved at compile-time
Cons:
No flexibility to expand at runtime
Process only specific options depending on their key
If you want to process only a specific subset of options, but keep the option to have (unprocessed) custom keys your approaches seem suitable.
In this case remember that using find like this:
it1 = _propertyMap.find("id");
has complexity O(logN) but is used M times, with M beeing the number of processed options. This is not the size of your map, it is the number of times you use find() to get a specific property. In your (shortened) example this means a complexity of O(2 * logN), since you only look for 2 keys.
So basically using M-times find() scales better than looping when only the size of the map increases, but worse if you increase the number of finds in the same manner. But only profiling can tell you which one is the faster for your size and use case.
Process all options depending on type
Since your map looks a lot like the keys can be custom but the types are from a small subset, consider looping over the map and using the types instead of the names to determine how to process them. Something like this:
for (it = _propertyMap.begin(); it != _propertyMap.end(); ++it )
{
if (it->first.type() == Property::UUID)
{
//process UUID values
}
else if (it->first.type() == Property::STRING)
{
//process STRING values
}
.....
}
This has the advantage, that you do not need any information about what the keys of your map really are, only what types it is able to store.
Suppose we have a map of N properties, and we are looking for a subset of P properties. Here a rough analysis, not knowing the statistical distribution of the keys:
In the pure map approach you search P times with a complexity of O(log(n)), that is O(p*log(n))
In the chained-if approach you are going to traverse once the map. That's O(N). But you should not forget that an if-then chain is also a (hiden) traversal of list of P elements. So for every of the N elements you are doing a search of potentially up to P elements. So that you have here a complexity of O(p*n).
This means that the map approach will outperform your traversal, and the performance gap will increase significantly with n. Of course this doesn't take into account function call overhead in map that you don't have in the if-chain. So that if P and N are small, your approach could still stand the theoretical comparison.
What you could eventually do to increase peformance further would be to use an unordered_map, which is O(1) in complexity, reducing your problem complexity to O(P).
There is another option which combines the best of both. Given a function like this (which is an adaptation of std::set_intersection):
template<class InputIt1, class InputIt2,
class Function, class Compare>
void match(InputIt1 first1, InputIt1 last1,
InputIt2 first2, InputIt2 last2,
Function f, Compare comp)
{
while (first1 != last1 && first2 != last2) {
if (comp(*first1,*first2)) {
++first1;
} else {
if (!comp(*first2,*first1)) {
f(*first1++,*first2);
}
++first2;
}
}
}
You can use it to process all your properties in O(N+M) time. Here is an example:
#include <map>
#include <string>
#include <functional>
#include <cassert>
using std::map;
using std::string;
using std::function;
struct Property {
enum Type { UUID, STRING };
Type type;
string value;
};
int main()
{
map<string,Property> properties;
map<string,function<void(Property&)>> processors;
properties["id"] = Property{Property::UUID,"12345678"};
properties["name"] = Property{Property::STRING,"name"};
bool id_found = false;
bool name_found = false;
processors["id"] = [&](Property&){ id_found = true; };
processors["name"] = [&](Property&){ name_found = true; };
match(
properties.begin(),properties.end(),
processors.begin(),processors.end(),
[](auto &a,auto &b){ b.second(a.second); },
[](auto &a,auto &b) { return a.first < b.first; }
);
assert(id_found && name_found);
}
The processors map can be built separately and reused to reduce the overhead.
Related
Have in C++ analog IDictionary.ContainsKey (TKey) or List.Contains (T) from C# ?
For example I have array of elements and need to know have this array some value or not ?
What is best way or best practics - without "foreach" for each element !
It will be good if it's will from std library for example.
UPD 1: In std lib have many containers, but I want to find a best way - faster, little code, less complicated and so on ...
Lookind that better desigion is std::unordered_set if going on this logic
#include <unordered_set>
std::unordered_set<std::string> NamesOfValues = {
"one",
"two",
"Date",
"Time"
};
// and now check is value exists in set
if(NamesOfValues.count(value))
{
// value exists
}
It seems most concise to use count, and this should work for any container.
if ( my_map.count(key) ) { // shorthand for `count(key) != 0`
// It exists
} else {
// It does not
}
If we're talking about [unordered_]map and [unordered_]set, which are closest to your original dictionary type, then these containers enforce unique keys, so the returned .count() can only be 0 or 1, and there's no need to worry about the code pointlessly iterating the rest of the container once it finds a match (as would occur for containers supporting duplicates)
Either way, simply using implicit conversion to bool leads to the most concise code. And if you end up having a design that might allow/need different counts per key, then you could compare against a specific value.
Your are looking for std::find. Find looks for an arbitrary type input to an arbitrary iterable and returns an iterator to that element.
For example, to find an element in a dictionary you can do the following:
std::unordered_map<char,int> my_map = { {'a', 1},{'b', 2}};
auto found_element = std::find(my_map.begin(), my_map.end(), 'a');
if( found_element == my_map.end() ){
//nothing was found
}
else{
// do something
}
For standard map you also have map.find(T) for O(1) access instead of O(n).
if( my_map.find('a') != my_map.end() ){
//something was found!
}
else{
//nothing was found
}
This is more clear than my_map.count()... you would only use that if you were actually trying to figure out how many elements you have and if you were using non unique keys.
I have a list of file names, with each representing a point in time. The list typically has thousands of elements. Given a time point, I'd like to convert these files names into time objects (I'm using boost::ptime), and then find the value of std::lower_bound of this time point with respect to the files names.
Example:
Filenames (with date + time, minutes increasing, with a minute for every file):
station01_20170612_030405.hdf5
station01_20170612_030505.hdf5
station01_20170612_030605.hdf5
station01_20170612_030705.hdf5
station01_20170612_030805.hdf5
station01_20170612_030905.hdf5
If I have a time-point 2017-06-12 03:06:00, then it fits here:
station01_20170612_030405.hdf5
station01_20170612_030505.hdf5
<--- The lower bound I am looking for is here
station01_20170612_030605.hdf5
station01_20170612_030705.hdf5
station01_20170612_030805.hdf5
station01_20170612_030905.hdf5
So far, everything is simple. Now the problem is that the list of files may be doped with some invalid file name, which will make the conversion to a time point fail.
Currently, I'm doing this the easy/inefficient way, and I'd like to optimize it, because this program will go on a server and the cost of operation matters. So, the stupid way is: Create a new list with time points, and only push time points that are valid:
vector<ptime> filesListTimePoints;
filesListTimePoints.reserve(filesList.size());
ptime time;
for(long i = 0; i < filesList.size(); i++) {
ErrorCode error = ConvertToTime(filesList[i], time);
if(error.errorCode() == SUCCESS)
filesListTimePoints.push_back(time);
}
//now use std::lower_bound() on filesListTimePoints
You see, the problem is that I'm using a linear solution with a problem that can be solved with O(log(N)) complexity. I don't need to convert all files or even look at all of them!
My question: How can I embed this into std::lower_bound, such that it remains with optimal complexity?
My idea of a possible solution:
On cppreference, there's a basic implementation of std::lower_bound. I'm thinking of modifying that to get a working solution. But I'm not sure what to do when a convesion fails, since this algorithm highly depends on monotonic behavior. Does this problem have a solution, even mathematically speaking?
Here's the version I'm thinking about initially:
template<class ForwardIt, class T>
ForwardIt lower_bound(ForwardIt first, ForwardIt last, const T& value)
{
ForwardIt it;
typename std::iterator_traits<ForwardIt>::difference_type count, step;
count = std::distance(first, last);
while (count > 0) {
it = first;
step = count / 2;
std::advance(it, step);
ErrorCode error = ConvertToTime(*it, time);
if(error.errorCode() == SUCCESS)
{
if (*it < value) {
first = ++it;
count -= step + 1;
}
else
count = step;
}
else {
// skip/ignore this point?
}
}
return first;
}
My ultimate solution (which might sound stupid) is to make this method a mutator of the list, and erase elements that are invalid. Is there a cleaner solution?
You can simply index by optional<ptime>. If you want to cache the converted values, consider making it a multimap<optional<ptime>, File>.
Better yet, make a datatype representing the file, and calculate the timepoint inside its constructor:
struct File {
File(std::string fname) : _fname(std::move(fname)), _time(parse_time(_fname)) { }
boost::optional<boost::posix_time::ptime> _time;
std::string _fname;
static boost::optional<boost::posix_time::ptime> parse_time(std::string const& fname) {
// return ptime or boost::none
}
};
Now, simply define operator< suitably or use e.g. boost::multi_index_container to index by _time
Further notes:
in case it wasn't clear, such a map/set will have it's own lower_bound, upper_bound and equal_range operations, and will obviously also work with std::lower_bound and friends.
there's always filter_iterator adaptor: http://www.boost.org/doc/libs/1_64_0/libs/iterator/doc/filter_iterator.html
My question is the following:
After using find on a std::map to get an iterator pointed to the desired element pair, is it possible to reuse that iterator on subsequent find()'s to take advantage of knowing that the elements im looking for afterwards are close to the first found element? Something like:
std::map<key, value> map_elements;
std::map<key, value>::iterator it;
it = map_elements.find(some_key);
it = it.find(a_close_key)
Thank you in advance
If you're sure it's really nearby, you could use std::find (instead of map::find) to do a linear search for the item. If it's within approximately log(N) items of the current position, this is likely to be a win (where N is the number of items in the map).
Also note that you'll have to figure out whether you want a search before or after the current position, and specify current, end() if it's after, and begin(), current if it's before. If it's before, you'll want to do a reverse search (find_end, if memory serves) since the target is presumably close to the end of that range.
Your question is not complete about how far Item1(found by map::find) can be far from Item2. In some case its more effecient to make new map::find; in some cases you can just iterate your iterator to find where your second item can be. Using just search map::find it will be O(log n) complexity and it can be around 10-20 steps.
So, if you know your Item2 is not so far, you can just iterate it iterator to find it out. Most important thing here is how to check you must stop search. std::map uses std::less<T> by default to arrange items, so it can be used to find out container is not containing Item2 at all. Something like this(not tested):
std::map<key, value> map_elements;
std::map<key, value>::iterator it, it2;
it2 = it = map_elements.find(some_key);
bool found=false;
while( it2!=map_elements.end() && !(a_close_key < it2->first) ) {
if( !(a_close_key < it2->first) && !(it2->first < a_close_key) ) {
//Equivalency is not ==, but its what used in std::map
found=true;
break;
}
it2++;
}
if( found ) {
//.... use it2
}
Inside if( found ) block your iterator it2 value should be same as if you called map_elements.lower_bound(a_close_key)
I have a map which I have declared as follows:
map<int, bool> index;
and I insert values into the map as:
int x; cin>>x;
index[x]=true;
However,
cout<<index[y]; // for any number y not inindexgives me 0
As I get the value 0 when I check for a key which is not present in the map, how can I reliably find out if a key is present in the map or not?
I'm using a map for trying to find out if two sets are disjoint or not, and for the same I am using a map, and two vectors to store the input. Is this shabby in any way? Some other data structure I should be using?
You can use if (index.find(key) == index.end()) to determine if a key is present. Using index[key] you default-construct a new value (in this case, you call bool(), and it gets printed as 0.) The newly constructed value also gets inserted into the map (i.e. index[key] is equal in this case to index.insert(std::make_pair(key, bool()).)
Using two data structures for the same data is ok. However, is there any need to use a map, wouldn't a set suffice in your use case? I.e. if they key is presents, the value is true, and false otherwise?
To find if two sets (given as std::set) are disjoint, you can simply compute their intersection:
std::set<T> X, Y; // populate
std::set<T> I;
std::set_difference(X.begin(), X.end(), y.begin(), y.end(), std::back_inserter(I));
const bool disjoint = I.empty();
If your containers aren't std::sets, you have to make sure the ranges are ordered.
If you want to be more efficient, you can implement the algorithm for set_intersection and stop once you have a common element:
template <typename Iter1, typename Iter2>
bool disjoint(Iter1 first1, Iter1 last1, Iter2 first2, Iter2 last2)
{
while (first1 != last1 && first2 != last2)
{
if (*first1 < *first2) ++first1;
else if (*first2 < *first1) ++first2;
else { return false; }
}
return true;
}
Use map::find.
you can use index.find(key) != index.end() or index.count(key) > 0
Depending on the range of index items, it might be good to use a bitmap (only makes sense for a reasonnably small range of possible index items. Will make checks for being disjoint super easy and efficient) or use a a set instead of a map (map stores additional bools that are not really needed). A set also offers methods count(key) and find(key)
1, Use index.count(y). It's more concise than and equivalent to index.find(y) != index.end(), except for the fact that it's an integer 1 or 0, whereas of course != gives you a bool.
The downside is that count is potentially less efficient for multimap than it is for map, since it may have to count more than one entry. Since you aren't using a multimap, no problem.
2, You could sort both vectors and use std::set_intersection, but it's not a perfect fit if all you care is whether the intersection is empty or not. Depending where the input comes from, you may be able to get rid of both vectors and just construct a map as you go from the first load of input, then check each element of the second load of input against it. Finally, use a set instead of a map.
Profiling my cpu-bound code has suggested I that spend a long time checking to see if a container contains completely unique elements. Assuming that I have some large container of unsorted elements (with < and = defined), I have two ideas on how this might be done:
The first using a set:
template <class T>
bool is_unique(vector<T> X) {
set<T> Y(X.begin(), X.end());
return X.size() == Y.size();
}
The second looping over the elements:
template <class T>
bool is_unique2(vector<T> X) {
typename vector<T>::iterator i,j;
for(i=X.begin();i!=X.end();++i) {
for(j=i+1;j!=X.end();++j) {
if(*i == *j) return 0;
}
}
return 1;
}
I've tested them the best I can, and from what I can gather from reading the documentation about STL, the answer is (as usual), it depends. I think that in the first case, if all the elements are unique it is very quick, but if there is a large degeneracy the operation seems to take O(N^2) time. For the nested iterator approach the opposite seems to be true, it is lighting fast if X[0]==X[1] but takes (understandably) O(N^2) time if all the elements are unique.
Is there a better way to do this, perhaps a STL algorithm built for this very purpose? If not, are there any suggestions eek out a bit more efficiency?
Your first example should be O(N log N) as set takes log N time for each insertion. I don't think a faster O is possible.
The second example is obviously O(N^2). The coefficient and memory usage are low, so it might be faster (or even the fastest) in some cases.
It depends what T is, but for generic performance, I'd recommend sorting a vector of pointers to the objects.
template< class T >
bool dereference_less( T const *l, T const *r )
{ return *l < *r; }
template <class T>
bool is_unique(vector<T> const &x) {
vector< T const * > vp;
vp.reserve( x.size() );
for ( size_t i = 0; i < x.size(); ++ i ) vp.push_back( &x[i] );
sort( vp.begin(), vp.end(), ptr_fun( &dereference_less<T> ) ); // O(N log N)
return adjacent_find( vp.begin(), vp.end(),
not2( ptr_fun( &dereference_less<T> ) ) ) // "opposite functor"
== vp.end(); // if no adjacent pair (vp_n,vp_n+1) has *vp_n < *vp_n+1
}
or in STL style,
template <class I>
bool is_unique(I first, I last) {
typedef typename iterator_traits<I>::value_type T;
…
And if you can reorder the original vector, of course,
template <class T>
bool is_unique(vector<T> &x) {
sort( x.begin(), x.end() ); // O(N log N)
return adjacent_find( x.begin(), x.end() ) == x.end();
}
You must sort the vector if you want to quickly determine if it has only unique elements. Otherwise the best you can do is O(n^2) runtime or O(n log n) runtime with O(n) space. I think it's best to write a function that assumes the input is sorted.
template<class Fwd>
bool is_unique(In first, In last)
{
return adjacent_find(first, last) == last;
}
then have the client sort the vector, or a make a sorted copy of the vector. This will open a door for dynamic programming. That is, if the client sorted the vector in the past then they have the option to keep and refer to that sorted vector so they can repeat this operation for O(n) runtime.
The standard library has std::unique, but that would require you to make a copy of the entire container (note that in both of your examples you make a copy of the entire vector as well, since you unnecessarily pass the vector by value).
template <typename T>
bool is_unique(std::vector<T> vec)
{
std::sort(vec.begin(), vec.end());
return std::unique(vec.begin(), vec.end()) == vec.end();
}
Whether this would be faster than using a std::set would, as you know, depend :-).
Is it infeasible to just use a container that provides this "guarantee" from the get-go? Would it be useful to flag a duplicate at the time of insertion rather than at some point in the future? When I've wanted to do something like this, that's the direction I've gone; just using the set as the "primary" container, and maybe building a parallel vector if I needed to maintain the original order, but of course that makes some assumptions about memory and CPU availability...
For one thing you could combine the advantages of both: stop building the set, if you have already discovered a duplicate:
template <class T>
bool is_unique(const std::vector<T>& vec)
{
std::set<T> test;
for (typename std::vector<T>::const_iterator it = vec.begin(); it != vec.end(); ++it) {
if (!test.insert(*it).second) {
return false;
}
}
return true;
}
BTW, Potatoswatter makes a good point that in the generic case you might want to avoid copying T, in which case you might use a std::set<const T*, dereference_less> instead.
You could of course potentially do much better if it wasn't generic. E.g if you had a vector of integers of known range, you could just mark in an array (or even bitset) if an element exists.
You can use std::unique, but it requires the range to be sorted first:
template <class T>
bool is_unique(vector<T> X) {
std::sort(X.begin(), X.end());
return std::unique(X.begin(), X.end()) == X.end();
}
std::unique modifies the sequence and returns an iterator to the end of the unique set, so if that's still the end of the vector then it must be unique.
This runs in nlog(n); the same as your set example. I don't think you can theoretically guarantee to do it faster, although using a C++0x std::unordered_set instead of std::set would do it in expected linear time - but that requires that your elements be hashable as well as having operator == defined, which might not be so easy.
Also, if you're not modifying the vector in your examples, you'd improve performance by passing it by const reference, so you don't make an unnecessary copy of it.
If I may add my own 2 cents.
First of all, as #Potatoswatter remarked, unless your elements are cheap to copy (built-in/small PODs) you'll want to use pointers to the original elements rather than copying them.
Second, there are 2 strategies available.
Simply ensure there is no duplicate inserted in the first place. This means, of course, controlling the insertion, which is generally achieved by creating a dedicated class (with the vector as attribute).
Whenever the property is needed, check for duplicates
I must admit I would lean toward the first. Encapsulation, clear separation of responsibilities and all that.
Anyway, there are a number of ways depending on the requirements. The first question is:
do we have to let the elements in the vector in a particular order or can we "mess" with them ?
If we can mess with them, I would suggest keeping the vector sorted: Loki::AssocVector should get you started.
If not, then we need to keep an index on the structure to ensure this property... wait a minute: Boost.MultiIndex to the rescue ?
Thirdly: as you remarked yourself a simple linear search doubled yield a O(N2) complexity in average which is no good.
If < is already defined, then sorting is obvious, with its O(N log N) complexity.
It might also be worth it to make T Hashable, because a std::tr1::hash_set could yield a better time (I know, you need a RandomAccessIterator, but if T is Hashable then it's easy to have T* Hashable to ;) )
But in the end the real issue here is that our advises are necessary generic because we lack data.
What is T, do you intend the algorithm to be generic ?
What is the number of elements ? 10, 100, 10.000, 1.000.000 ? Because asymptotic complexity is kind of moot when dealing with a few hundreds....
And of course: can you ensure unicity at insertion time ? Can you modify the vector itself ?
Well, your first one should only take N log(N), so it's clearly the better worse case scenario for this application.
However, you should be able to get a better best case if you check as you add things to the set:
template <class T>
bool is_unique3(vector<T> X) {
set<T> Y;
typename vector<T>::const_iterator i;
for(i=X.begin(); i!=X.end(); ++i) {
if (Y.find(*i) != Y.end()) {
return false;
}
Y.insert(*i);
}
return true;
}
This should have O(1) best case, O(N log(N)) worst case, and average case depends on the distribution of the inputs.
If the type T You store in Your vector is large and copying it is costly, consider creating a vector of pointers or iterators to Your vector elements. Sort it based on the element pointed to and then check for uniqueness.
You can also use the std::set for that. The template looks like this
template <class Key,class Traits=less<Key>,class Allocator=allocator<Key> > class set
I think You can provide appropriate Traits parameter and insert raw pointers for speed or implement a simple wrapper class for pointers with < operator.
Don't use the constructor for inserting into the set. Use insert method. The method (one of overloads) has a signature
pair <iterator, bool> insert(const value_type& _Val);
By checking the result (second member) You can often detect the duplicate much quicker, than if You inserted all elements.
In the (very) special case of sorting discrete values with a known, not too big, maximum value N.
You should be able to start a bucket sort and simply check that the number of values in each bucket is below 2.
bool is_unique(const vector<int>& X, int N)
{
vector<int> buckets(N,0);
typename vector<int>::const_iterator i;
for(i = X.begin(); i != X.end(); ++i)
if(++buckets[*i] > 1)
return false;
return true;
}
The complexity of this would be O(n).
Using the current C++ standard containers, you have a good solution in your first example. But if you can use a hash container, you might be able to do better, as the hash set will be nO(1) instead of nO(log n) for a standard set. Of course everything will depend on the size of n and your particular library implementation.