I understand that question sounds weird, so here is a bit of context.
Recently I was disappointed to learn that map reduce in C++20 ranges does not work as one would expect i.e.
const double val = data | transform(...) | accumulate (...);
does not work, you must write it this unnatural way:
const double val = accumulate(data | transform(...));
Details can be found here and here, but it boils down to the fact that accumulate can not disambiguate between 2 different usecases.
So this got me thinking:
If C++20 required that you must use pipe for using ranges, aka you can not write
vector<int> v;
sort(v);
but you must write
vector<int> v
v|sort();
would that would solve problem of ambiguity?
And if so although unnatural to people using std::sort and other STL algorithms I wonder if in the long run that would be a better design choice.
Note:
If this question is too vague feel free to vote to close, but I feel that this is a legitimate design question that can be answered in relatively unbiased way, especially if my understanding of the problem is wrong.
You need to differentiate between range algorithms and range adaptors. Algorithms are functions that perform a generic operation on a range of values. Adaptors are functions which create range views that modify the presentation of a range. Adaptors are chained by the | operator; algorithms are just regular functions.
Sometimes, the same conceptual thing can have an algorithm and adapter form. transform exists as both an algorithm and an adapter. The former stores the transformation into an output range; the latter creates a view range of the input that lazily computes the transformation as requested.
These are different tasks for different needs and uses.
Also, note that there is no sort adapter in C++20. A sort adapter would have to create a view range that somehow mixed around the elements in the source range. It would have to allocate storage for the new sequence of values (even if it's just sorting iterators/pointers/indices to the values). And the sorting would have to be done at construction time, so there would be no lazy operation taking place.
This is also why accumulate doesn't work that way. It's not a matter of "ambiguity"; it's a matter of the fundamental nature of the operation. Accumulation computes a value from a range; it does not compute a new range from an existing one. That's the work of an algorithm, not an adapter.
Some tasks are useful in algorithm form. Some tasks are useful in adapter form (you find very few zip-like algorithms). Some tasks are useful in both. But because these are two separate concepts for different purposes, they have different ways of invoking them.
would that would solve problem of ambiguity?
Yes.
If there's only one way to write something, that one way must be the only possible interpretation. If an algorithm "call" can only ever be a partial call to the algorithm that must be completed with a | operation with a range on the left hand side, then you'd never even have the question of if the algorithm call is partial or total. It's just always partial.
No ambiguity in that sense.
But if you went that route though, you end up with things like:
auto sum = accumulate("hello"s);
Which doesn't actually sum the chars in that string and actually is placeholder that is waiting on a range to accumulate over with the initial value "hello"s.
Related
I'm trying to create a general memoizator for multiple and arbitrary functions.
For each function std::function<ReturnType(Args...)> that we want to memoize, we unordered_map<Args ..., ReturnType> (I'm keeping things simple on purpose).
The big problem comes when our memoized function has some really big argument Args ...: for example let suppose that our function sort a vector of 10 millions numbers and then returns the sorted vector, so something like std::function<vector<double>(vector<double>)>.
As you can imagine, after having inserted less than 100 vectors, we have already filled 8 GBS of memory. Notice that maybe this is given from the combination of huge vectors and the memory required by the sorting algorithm (I didn't investigate on the causes).
So what about if instead of the structure described above, we define unordered_map<UUID(Args ...), ReturnType> (where UUID= Universally Unique Identifier)? We should relax the deterministic feature (so maybe we return a wrong error), but with a very low probability.
The problem is that since I never used UUIDs, I don't know if there are suitable implementations for this application.
So my question is:
There exists a better solution than UUIDs for this problem?
Which UUID implementation is better suitable for this problem?
boost uuid is a possible candidate?
Unfortunately, the problem could be solved for Args ... but not for ReturnType, so there is a solution for memoized result?
Notice that the UUIDs generated for the object x should be the same even in different runs and machines.
Notice that if we have the same UUID for two different objects (and so we return the wrong value) with a really low probability, then it could be acceptable...let's say that this could be a "probabilistic memoizator".
I know that this application doesn't make sense in a memoization context (what are the odds that an user asks two times to sort the same 10 millions elements vector?), but it's time and memory expensive (so good for benchmarking and to introduce the memory problem that I stated above), so please don't whip and crucify me because this is an absurd memoization application.
Identifying any object is easy. The address is "object identity" in C++. This is also the reason that even empty classes cannot have zero size.
Now, what you want is value equivalence. That's strictly not in the language domain. It's solidly in the application/library logic domain.
You should consider using something like boost::flyweights. It has precisely this facility, and makes it "easy" to customize the equivalence semantics for your types.
I believe I'd like to use boost::icl::interval_map to solve a problem (described here, I'll post a complete answer if interval_maps ultimately work.)
I want to use an interval_map<unsigned long long, set<foo*>>, but the documentation for boost::icl mentions that there are potential efficiency problems (below from).
We are introducing interval_maps using an interval map of sets of strings, because of it's didactic advantages. The party example is used to give an immediate access to the basic ideas of interval maps and aggregate on overlap. For real world applications, an interval_map of sets is not necessarily recommended. It has the same efficiency problems as a std::map of std::sets. There is a big realm though of using interval_maps with numerical and other efficient data types for the associated values.
What are the efficiency issues with std::map of std::sets?
and How can I avoid them?
Both std::map<K, V> and std::set<V> are node based containers linked by pointers. Traversing them has nice complexity guarantees (i.e., each individual operation is at most O(log n)) but you actually need fairly sizable containers for the complexity to matter compared, e.g., to a std::vector<std::pair<K, V>> especially when K and V are fundamental types. The main performance issue with node based containers is that they are laid out more or less randomly in memory while contemporary CPUs like to access data which is clustered in some form.
Of course, as usual you'll need to measure the times obtained between different implementations on fairly realistic data sets to determine which data structure yields the best performance.
A generally asked question is whether we should use unordered_map or map for faster access.
The most common( rather age old ) answer to this question is:
If you want direct access to single elements, use unordered_map but if you want to iterate over elements(most likely in a sorted way) use map.
Shouldn't we consider the data type of key while making such a choice?
As hash algorithm for one dataType(say int) may be more collision prone than other(say string).
If that is the case( the hash algorithm is quite collision prone ), then I would probably use map even for direct access as in that case the O(1) constant time(probably averaged over large no. of inputs) for unordered_map map be more than lg(N) even for fairly large value of N.
You raise a good point... but you are focusing on the wrong part.
The problem is not the type of the key, per se, but on the hash function that is used to derive a hash value for that key.
Lexicographical ordering is easy: if you tell me you want to order a structure according to its 3 fields (and they already support ordering themselves) then I'll just write:
bool operator<(Struct const& left, Struct const& right) {
return boost::tie(left._1, left._2, left._3)
< boost::tie(right._1, right._2, right._3);
}
And I am done!
However writing a hash function is difficult. You need some knowledge about the distribution of your data (statistics), you might need to prevent specially crafted attacks, etc... Honestly, I do not expect many people of being able to craft a good hash function. But the worst part is, composition is difficult too! Given two independent fields, combining their hash value right is hard (hint: boost::hash_combine).
So, indeed, if you have no idea what you are doing and you are treating user-crafted data, just stick to a map. It's maybe slower (not sure), but it's safer.
There isn't really such a thing as collision prone object, because this thing is dependent on the hash function you use. Assuming the objects are not identical - there is some feature that can be utilized to create an informative hash function to be used.
Assuming you have some knowledge on your data - and you know it is likely to have a lot of collision for some hash function h1() - then you should find and use a different hash function h2() which is better suited for this task.
That said, there are other issues as well why to favor tree based data structures over hash bases (such as latency and the size of the set), some are covered by my answer in this thread.
There's no point trying to be too clever about this. As always, profile, compare, optimise if useful. There are many factors involved - quite a few of which aren't specified in the Standard and will vary across compilers. Some things may profile better or worse on specific hardware. If you are interested in this stuff (or paid to pretend to be) you should learn about these things a bit more systematically. You might start with learning a bit about actual hash functions and their characteristics. It's extremely rare to be unable to find a hash function that has - for all practical purposes - no more collision proneness than a random but repeatable value - it's just that sometimes it's slower to approach that point than it is to handle a few extra collisions.
In one of the applications I work on, it is necessary to have a function like this:
bool IsInList(int iTest)
{
//Return if iTest appears in a set of numbers.
}
The number list is known at app load up (But are not always the same between two instances of the same application) and will not change (or added to) throughout the whole of the program. The integers themselves maybe large and have a large range so it is not efficient to have a vector<bool>. Performance is a issue as the function sits in a hot spot. I have heard about Perfect hashing but could not find out any good advice. Any pointers would be helpful. Thanks.
p.s. I'd ideally like if the solution isn't a third party library because I can't use them here. Something simple enough to be understood and manually implemented would be great if it were possible.
I would suggest using Bloom Filters in conjunction with a simple std::map.
Unfortunately the bloom filter is not part of the standard library, so you'll have to implement it yourself. However it turns out to be quite a simple structure!
A Bloom Filter is a data structure that is specialized in the question: Is this element part of the set, but does so with an incredibly tight memory requirement, and quite fast too.
The slight catch is that the answer is... special: Is this element part of the set ?
No
Maybe (with a given probability depending on the properties of the Bloom Filter)
This looks strange until you look at the implementation, and it may require some tuning (there are several properties) to lower the probability but...
What is really interesting for you, is that for all the cases it answers No, you have the guarantee that it isn't part of the set.
As such a Bloom Filter is ideal as a doorman for a Binary Tree or a Hash Map. Carefully tuned it will only let very few false positive pass. For example, gcc uses one.
What comes to my mind is gperf. However, it is based in strings and not in numbers. However, part of the calculation can be tweaked to use numbers as input for the hash generator.
integers, strings, doesn't matter
http://videolectures.net/mit6046jf05_leiserson_lec08/
After the intro, at 49:38, you'll learn how to do this. The Dot Product hash function is demonstrated since it has an elegant proof. Most hash functions are like voodoo black magic. Don't waste time here, find something that is FAST for your datatype and that offers some adjustable SEED for hashing. A good combo there is better than the alternative of growing the hash table.
#54:30 The Prof. draws picture of a standard way of doing perfect hash. The perfect minimal hash is beyond this lecture. (good luck!)
It really all depends on what you mod by.
Keep in mind, the analysis he shows can be further optimized by knowing the hardware you are running on.
The std::map you get very good performance in 99.9% scenarios. If your hot spot has the same iTest(s) multiple times, combine the map result with a temporary hash cache.
Int is one of the datatypes where it is possible to just do:
bool hash[UINT_MAX]; // stackoverflow ;)
And fill it up. If you don't care about negative numbers, then it's twice as easy.
A perfect hash function maps a set of inputs onto the integers with no collisions. Given that your input is a set of integers, the values themselves are a perfect hash function. That really has nothing to do with the problem at hand.
The most obvious and easy to implement solution for testing existence would be a sorted list or balanced binary tree. Then you could decide existence in log(N) time. I doubt it'll get much better than that.
For this problem I would use a binary search, assuming it's possible to keep the list of numbers sorted.
Wikipedia has example implementations that should be simple enough to translate to C++.
It's not necessary or practical to aim for mapping N distinct randomly dispersed integers to N contiguous buckets - i.e. a perfect minimal hash - the important thing is to identify an acceptable ratio. To do this at run-time, you can start by configuring a worst-acceptible ratio (say 1 to 20) and a no-point-being-better-than-this-ratio (say 1 to 4), then randomly vary (e.g. changing prime numbers used) a fast-to-calculate hash algorithm to see how easily you can meet increasingly difficult ratios. For worst-acceptible you don't time out, or you fall back on something slower but reliable (container or displacement lists to resolve collisions). Then, allow a second or ten (configurable) for each X% better until you can't succeed at that ratio or reach the no-pint-being-better ratio....
Just so everyone's clear, this works for inputs only known at run time with no useful patterns known beforehand, which is why different hash functions have to be trialed or actively derived at run time. It is not acceptible to simple say "integer inputs form a hash", because there are collisions when %-ed into any sane array size. But, you don't need to aim for a perfectly packed array either. Remember too that you can have a sparse array of pointers to a packed array, so there's little memory wasted for large objects.
Original Question
After working with it for a while, I came up with a number of hash functions that seemed to work reasonably well on strings, resulting in a unique - perfect hashing.
Let's say the values ranged from L to H in the array. This yields a Range R = H - L + 1.
Generally it was pretty big.
I then applied the modulus operator from H down to L + 1, looking for a mapping that keeps them unique, but has a smaller range.
In you case you are using integers. Technically, they are already hashed, but the range is large.
It may be that you can get what you want, simply by applying the modulus operator.
It may be that you need to put a hash function in front of it first.
It also may be that you can't find a perfect hash for it, in which case your container class should have a fall back position.... binary search, or map or something like that, so that
you can guarantee that the container will work in all cases.
A trie or perhaps a van Emde Boas tree might be a better bet for creating a space efficient set of integers with lookup time bring constant against the number of objects in the data structure, assuming that even std::bitset would be too large.
I was browsing the SGI STL documentation and ran into project1st<Arg1, Arg2>. I understand its definition, but I am having a hard time imagining a practical usage.
Have you ever used project1st or can you imagine a scenario?
A variant of project1st (taking a std::pair, and returning .first) is quite useful. You can use it in combination with std::transform to copy the keys from a std::map<K,V> to a std::vector<K>. Similarly, a variant of project2nd can be used to copy the values from a map to a vector<V>.
As it happens, none of the standard algorithms really benefits from project1st. The closest is partial_sum(project1st), which would set all output elements to the first input element. It mainly exists because the STL is heavily founded in mathematical set theory, and there operations like project1st are basic building blocks.
My guess is that if you were using the strategy pattern and had a situation where you needed to pass an identity object, this would be a good choice. For example, there might be a case where an algorithm takes several such objects, and perhaps it is possible that you want one of them to do nothing under some situation.
Parallel programming. Imagine a situation where two processes come up with two valid but different results for a given computation, and you need to force them to be the same. project1st/2nd provides a very convenient way to perform this operation on a whole container, using an appropriate parallel call that takes a functor as an argument.
I assume that someone had a practical use for it, or it wouldn't have been written, but I'm drawing a blank on what it might have been. Presumably its use-case is similar to the identity function that the description mentions, where there's no real need for processing but the syntax requires a functor anyway.
The example on that same page suggests using it with the two-container form of std::transform, but if I'm not mistaken, the way they're using it is functionally identical to std::copy, so I don't see the point.
It looks like a solution in search of a problem to me.