I'm working with data that is unique from other data of the same type. Very abstractly, a set fits the definition of the data I'm working with. I feel inclined to use std::unordered_set instead of std::vector for that reason.
Beyond that, both classes can fit my requirements. My question is about performance -- which might perform better? I cannot write out the code one way and benchmark it, then rewrite it the other way. That will take me hundreds of hours. If they'll perform similarly, do you think it would be worth-while to stick with the idiomatic unordered_set?
Here is a simpler use case. A company is selling computers. Each is unique from another in at least one way, guaranteed.
struct computer_t
{
std::string serial;
std::uint32_t gb_of_ram;
};
std::unordered_set<computer_t> all_computers_in_existence;
std::unordered_set<computer_t> computers_for_sale; // subset of above
// alternatively
std::vector<computer_t> all_computers_in_existence;
std::vector<computer_t> computers_for_sale; // subset of above
The company wants to stop selling computers that aren't popular and replace them with other computers that might be.
std::unordered_set<computer_t> computers_not_for_sale;
std::set_difference(all_computers_in_existence.begin(), all_computers_in_existence.end(),
computers_for_sale.begin(), computers_for_sale.end(),
std::inserter(computers_not_for_sale, computers_not_for_sale.end()));
calculate_and_remove_least_sold(computers_for_sale);
calculate_and_add_most_likely_to_sell(computers_for_sale, computers_not_for_sale);
Based on the above sample code, what should I choose? Or is there another, new STL feature (in C++17) I should investigate? This really is as generic as it gets for my use-case without making this post incredibly long with details.
Idiomatic should be your first choice. If you implement it using unordered_set and the performance is not good enough, there are faster non-STL hash tables which are easy to switch to. 99% of the time it won't come to that.
Your example code using std::set_difference will not work, because that requires the inputs be sorted, which unordered_set is not. That's OK though, subtracting is done easily using unordered_set::erase(key).
I understand that question sounds weird, so here is a bit of context.
Recently I was disappointed to learn that map reduce in C++20 ranges does not work as one would expect i.e.
const double val = data | transform(...) | accumulate (...);
does not work, you must write it this unnatural way:
const double val = accumulate(data | transform(...));
Details can be found here and here, but it boils down to the fact that accumulate can not disambiguate between 2 different usecases.
So this got me thinking:
If C++20 required that you must use pipe for using ranges, aka you can not write
vector<int> v;
sort(v);
but you must write
vector<int> v
v|sort();
would that would solve problem of ambiguity?
And if so although unnatural to people using std::sort and other STL algorithms I wonder if in the long run that would be a better design choice.
Note:
If this question is too vague feel free to vote to close, but I feel that this is a legitimate design question that can be answered in relatively unbiased way, especially if my understanding of the problem is wrong.
You need to differentiate between range algorithms and range adaptors. Algorithms are functions that perform a generic operation on a range of values. Adaptors are functions which create range views that modify the presentation of a range. Adaptors are chained by the | operator; algorithms are just regular functions.
Sometimes, the same conceptual thing can have an algorithm and adapter form. transform exists as both an algorithm and an adapter. The former stores the transformation into an output range; the latter creates a view range of the input that lazily computes the transformation as requested.
These are different tasks for different needs and uses.
Also, note that there is no sort adapter in C++20. A sort adapter would have to create a view range that somehow mixed around the elements in the source range. It would have to allocate storage for the new sequence of values (even if it's just sorting iterators/pointers/indices to the values). And the sorting would have to be done at construction time, so there would be no lazy operation taking place.
This is also why accumulate doesn't work that way. It's not a matter of "ambiguity"; it's a matter of the fundamental nature of the operation. Accumulation computes a value from a range; it does not compute a new range from an existing one. That's the work of an algorithm, not an adapter.
Some tasks are useful in algorithm form. Some tasks are useful in adapter form (you find very few zip-like algorithms). Some tasks are useful in both. But because these are two separate concepts for different purposes, they have different ways of invoking them.
would that would solve problem of ambiguity?
Yes.
If there's only one way to write something, that one way must be the only possible interpretation. If an algorithm "call" can only ever be a partial call to the algorithm that must be completed with a | operation with a range on the left hand side, then you'd never even have the question of if the algorithm call is partial or total. It's just always partial.
No ambiguity in that sense.
But if you went that route though, you end up with things like:
auto sum = accumulate("hello"s);
Which doesn't actually sum the chars in that string and actually is placeholder that is waiting on a range to accumulate over with the initial value "hello"s.
I'm trying to create a general memoizator for multiple and arbitrary functions.
For each function std::function<ReturnType(Args...)> that we want to memoize, we unordered_map<Args ..., ReturnType> (I'm keeping things simple on purpose).
The big problem comes when our memoized function has some really big argument Args ...: for example let suppose that our function sort a vector of 10 millions numbers and then returns the sorted vector, so something like std::function<vector<double>(vector<double>)>.
As you can imagine, after having inserted less than 100 vectors, we have already filled 8 GBS of memory. Notice that maybe this is given from the combination of huge vectors and the memory required by the sorting algorithm (I didn't investigate on the causes).
So what about if instead of the structure described above, we define unordered_map<UUID(Args ...), ReturnType> (where UUID= Universally Unique Identifier)? We should relax the deterministic feature (so maybe we return a wrong error), but with a very low probability.
The problem is that since I never used UUIDs, I don't know if there are suitable implementations for this application.
So my question is:
There exists a better solution than UUIDs for this problem?
Which UUID implementation is better suitable for this problem?
boost uuid is a possible candidate?
Unfortunately, the problem could be solved for Args ... but not for ReturnType, so there is a solution for memoized result?
Notice that the UUIDs generated for the object x should be the same even in different runs and machines.
Notice that if we have the same UUID for two different objects (and so we return the wrong value) with a really low probability, then it could be acceptable...let's say that this could be a "probabilistic memoizator".
I know that this application doesn't make sense in a memoization context (what are the odds that an user asks two times to sort the same 10 millions elements vector?), but it's time and memory expensive (so good for benchmarking and to introduce the memory problem that I stated above), so please don't whip and crucify me because this is an absurd memoization application.
Identifying any object is easy. The address is "object identity" in C++. This is also the reason that even empty classes cannot have zero size.
Now, what you want is value equivalence. That's strictly not in the language domain. It's solidly in the application/library logic domain.
You should consider using something like boost::flyweights. It has precisely this facility, and makes it "easy" to customize the equivalence semantics for your types.
As I try to modernize my C++ skills, I keep encountering this situation where "the STL way" isn't obvious to me.
I have an object that wants to gather contributions from multiple sources into a container (typically a std::vector). Each source is an object, and each of those objects provides a method get_contributions() that returns any number of contributions (from 0 to many). The gatherer will call get_contributions() on each contributor and aggregate the results into a single collection.
The question is, what's the best signature for get_contributions()?
Option 1: std::vector<contribution> get_contributions() const
This is the most straightforward, but it leads to lots of copying as the gatherer copies each set of results into the master collection. And yes, performance matters here. For example, if the contributors were geometric models and getting contributions amounted to tesselating them into triangles for rendering, then speed would count and the number of contributions could be enormous.
Option 2: template <typename container> void get_contributions(container &target) const
This allows each contributor to add its contributions directly to the master container by calling target.push_back(foo). The drawback here is that we're exposing the container to other types of inspection and manipulation. I'd prefer to keep the interface as narrow as possible.
Option 3: template <typename out_it> void get_contributions(out_it &it) const
In this solution, the aggregator would pass a std::back_insert_iterator for the master collection, and the individual contributors would do *it++ = foo; for each contribution. This is the best I've come up with so far, but I'm left with the feeling that there must be a more elegant way. The back_insert_iterator feels like a kludge.
Is Option 3 the best, or is there a better approach? Does this gathering pattern have a name?
There's a fourth, that would require you to define you iterator ranges. Check out Alexandrescu's presentation on "Iterators must go".
Option 3 is the most idiomatic way. Note that you don't have to use back_insert_iterator. If you know how many elements are going to be added, you can resize the vector, and then provide a regular vector iterator instead. It won't call push_back then (and potentially save you some copying)
back_insert_iterator's main advantage is that it expands the vector as needed.
It's not a kludge though. It's designed for this exact purpose.
One minor adjustment would be to take pass the iterator by value, and then return it when the function returns.
I would say there are two idiomatic STL ways: your Option 3 (taking an output iterator, which you'd pass by value, by the way) and taking a functor which you would call with each of the contributions.
Each of these is only appropriate if it is suitable to implement get_contributions as a template, of course.
Why does nobody seem to use tuples in C++, either the Boost Tuple Library or the standard library for TR1? I have read a lot of C++ code, and very rarely do I see the use of tuples, but I often see lots of places where tuples would solve many problems (usually returning multiple values from functions).
Tuples allow you to do all kinds of cool things like this:
tie(a,b) = make_tuple(b,a); //swap a and b
That is certainly better than this:
temp=a;
a=b;
b=temp;
Of course you could always do this:
swap(a,b);
But what if you want to rotate three values? You can do this with tuples:
tie(a,b,c) = make_tuple(b,c,a);
Tuples also make it much easier to return multiple variable from a function, which is probably a much more common case than swapping values. Using references to return values is certainly not very elegant.
Are there any big drawbacks to tuples that I'm not thinking of? If not, why are they rarely used? Are they slower? Or is it just that people are not used to them? Is it a good idea to use tuples?
A cynical answer is that many people program in C++, but do not understand and/or use the higher level functionality. Sometimes it is because they are not allowed, but many simply do not try (or even understand).
As a non-boost example: how many folks use functionality found in <algorithm>?
In other words, many C++ programmers are simply C programmers using C++ compilers, and perhaps std::vector and std::list. That is one reason why the use of boost::tuple is not more common.
Because it's not yet standard. Anything non-standard has a much higher hurdle. Pieces of Boost have become popular because programmers were clamoring for them. (hash_map leaps to mind). But while tuple is handy, it's not such an overwhelming and clear win that people bother with it.
The C++ tuple syntax can be quite a bit more verbose than most people would like.
Consider:
typedef boost::tuple<MyClass1,MyClass2,MyClass3> MyTuple;
So if you want to make extensive use of tuples you either get tuple typedefs everywhere or you get annoyingly long type names everywhere. I like tuples. I use them when necessary. But it's usually limited to a couple of situations, like an N-element index or when using multimaps to tie the range iterator pairs. And it's usually in a very limited scope.
It's all very ugly and hacky looking when compared to something like Haskell or Python. When C++0x gets here and we get the 'auto' keyword tuples will begin to look a lot more attractive.
The usefulness of tuples is inversely proportional to the number of keystrokes required to declare, pack, and unpack them.
For me, it's habit, hands down: Tuples don't solve any new problems for me, just a few I can already handle just fine. Swapping values still feels easier the old fashioned way -- and, more importantly, I don't really think about how to swap "better." It's good enough as-is.
Personally, I don't think tuples are a great solution to returning multiple values -- sounds like a job for structs.
But what if you want to rotate three values?
swap(a,b);
swap(b,c); // I knew those permutation theory lectures would come in handy.
OK, so with 4 etc values, eventually the n-tuple becomes less code than n-1 swaps. And with default swap this does 6 assignments instead of the 4 you'd have if you implemented a three-cycle template yourself, although I'd hope the compiler would solve that for simple types.
You can come up with scenarios where swaps are unwieldy or inappropriate, for example:
tie(a,b,c) = make_tuple(b*c,a*c,a*b);
is a bit awkward to unpack.
Point is, though, there are known ways of dealing with the most common situations that tuples are good for, and hence no great urgency to take up tuples. If nothing else, I'm not confident that:
tie(a,b,c) = make_tuple(b,c,a);
doesn't do 6 copies, making it utterly unsuitable for some types (collections being the most obvious). Feel free to persuade me that tuples are a good idea for "large" types, by saying this ain't so :-)
For returning multiple values, tuples are perfect if the values are of incompatible types, but some folks don't like them if it's possible for the caller to get them in the wrong order. Some folks don't like multiple return values at all, and don't want to encourage their use by making them easier. Some folks just prefer named structures for in and out parameters, and probably couldn't be persuaded with a baseball bat to use tuples. No accounting for taste.
As many people pointed out, tuples are just not that useful as other features.
The swapping and rotating gimmicks are just gimmicks. They are utterly confusing to those who have not seen them before, and since it is pretty much everyone, these gimmicks are just poor software engineering practice.
Returning multiple values using tuples is much less self-documenting then the alternatives -- returning named types or using named references. Without this self-documenting, it is easy to confuse the order of the returned values, if they are mutually convertible, and not be any wiser.
Not everyone can use boost, and TR1 isn't widely available yet.
When using C++ on embedded systems, pulling in Boost libraries gets complex. They couple to each other, so library size grows. You return data structures or use parameter passing instead of tuples. When returning tuples in Python the data structure is in the order and type of the returned values its just not explicit.
You rarely see them because well-designed code usually doesn't need them- there are not to many cases in the wild where using an anonymous struct is superior to using a named one.
Since all a tuple really represents is an anonymous struct, most coders in most situations just go with the real thing.
Say we have a function "f" where a tuple return might make sense. As a general rule, such functions are usually complicated enough that they can fail.
If "f" CAN fail, you need a status return- after all, you don't want callers to have to inspect every parameter to detect failure. "f" probably fits into the pattern:
struct ReturnInts ( int y,z; }
bool f(int x, ReturnInts& vals);
int x = 0;
ReturnInts vals;
if(!f(x, vals)) {
..report error..
..error handling/return...
}
That isn't pretty, but look at how ugly the alternative is. Note that I still need a status value, but the code is no more readable and not shorter. It is probably slower too, since I incur the cost of 1 copy with the tuple.
std::tuple<int, int, bool> f(int x);
int x = 0;
std::tuple<int, int, bool> result = f(x); // or "auto result = f(x)"
if(!result.get<2>()) {
... report error, error handling ...
}
Another, significant downside is hidden in here- with "ReturnInts" I can add alter "f"'s return by modifying "ReturnInts" WITHOUT ALTERING "f"'s INTERFACE. The tuple solution does not offer that critical feature, which makes it the inferior answer for any library code.
Certainly tuples can be useful, but as mentioned there's a bit of overhead and a hurdle or two you have to jump through before you can even really use them.
If your program consistently finds places where you need to return multiple values or swap several values, it might be worth it to go the tuple route, but otherwise sometimes it's just easier to do things the classic way.
Generally speaking, not everyone already has Boost installed, and I certainly wouldn't go through the hassle of downloading it and configuring my include directories to work with it just for its tuple facilities. I think you'll find that people already using Boost are more likely to find tuple uses in their programs than non-Boost users, and migrants from other languages (Python comes to mind) are more likely to simply be upset about the lack of tuples in C++ than to explore methods of adding tuple support.
As a data-store std::tuple has the worst characteristics of both a struct and an array; all access is nth position based but one cannot iterate through a tuple using a for loop.
So if the elements in the tuple are conceptually an array, I will use an array and if the elements are not conceptually an array, a struct (which has named elements) is more maintainable. ( a.lastname is more explanatory than std::get<1>(a)).
This leaves the transformation mentioned by the OP as the only viable usecase for tuples.
I have a feeling that many use Boost.Any and Boost.Variant (with some engineering) instead of Boost.Tuple.