I was using lambda function in sort() function. In my lambda function I return true if two are equal. Then I got segmentation error.
After reviewing C++ Compare, it says
For all a, comp(a,a) == false
I don't understand why it must be false. Why can't I let comp(a,a)==true?
(Thanks in advance)
Think of Comp as some sort of "is smaller than" relationship, that is it defines some kind of ordering on a set of data.
Now you probably want to do some stuff with this relationship, like sorting data in increasing order, binary search in sorted data, etc.
There are many algorithms that do stuff like this very fast, but they usually have the requirement that the ordering they deal with is "reasonable", which was formalized with the term Strict weak ordering. It is defined by the rules in the link you gave, and the first one basically means:
"No element shall be smaller than itself."
This is indeed reasonable to assume, and one of the things our algorithms require.
Related
The standard library defines weak ordering, partial ordering and strong ordering. these types define semantics for orderings that imply 3 of the 4 combinations of implying/not implying substitutability (a == b implies f(a) == f(b) where f() reads comparison-related state) and allowing/disallowing incomparable values (a < b, a == b and a > b may all be false).
I have a situation where my three-way (spaceship) operator has the semantics of equality implies substitutability, but some values may be incomparable. I could return a weak_ordering from my three-way comparison, but this lacks the semantic meaning of my type. I could also define my own ordering type, but I am reluctant to do that without understanding why it was omitted.
I believe the standard library's orderings are equivalent to the mathematical definition of weak ordering, partial ordering and total ordering. however, those are defined without the notion of substitutability. Is there a mathematical ordering equivalent to the one I am describing? Is there a reason why it was omitted from the standard?
It's not entirely clear from the original paper why the possibility of a "strong partial order" wasn't included. It already had comparison categories that were themselves incomparable (e.g., weak_ordering and strong_equality), so it could presumably have included another such pair.
One possibility is that the mathematical notion of a partial order actually requires that no two (distinct) elements are equal in the sense that a ≤ b and b ≤ a (so that the graph induced by ≤ is acyclic). People therefore don't usually talk about what "equivalent" means in a partial order. Relaxing that restriction produces a preorder, or equivalently one can talk about a preorder as a partial order on a set of equivalence classes projected onto their elements. That's truly what std::partial_ordering means if we consider there to exist multiple values that compare equivalent.
Of course, the easy morphism between preorders and partial orders limits the significance of the restriction on the latter; "distinct" can always be taken to mean "not in the same equivalence class" so long as you equally apply the equivalence-class projection to any coordinate questions of (say) cardinality. In the same fashion, the standard's notion of substitutability is not very useful: it makes very little sense to say that a std::weak_ordering operator<=> produces equivalent for two values that differ "where f() reads comparison-related state" since if the state were comparison-related operator<=> would distinguish them.
The practical answer is therefore to largely ignore std::weak_ordering and consider your choices to be total/partial ordering—in each case defined on the result of some implicit projection onto equivalence classes. (The standard talks about a "strict weak order" in the same fashion, supposing that there is some "true identity" that equivalent objects might lack but which is irrelevant anyway. Another term for such a relation is a "total preorder", so you could also consider the two meaningful comparison categories to simply be preorders that might or might not be total. That makes for a pleasingly concise categorization—"<=> defines preorders; is yours total?"—but would probably lose many readers.)
I need to sort a file list by date. There's this answer how to do it. It worries me though: it operates on a live filesystem that can change during operation.
The comparison function uses:
struct FileNameModificationDateComparator{
//Returns true if and only if lhs < rhs
bool operator() (const std::string& lhs, const std::string& rhs){
struct stat attribLhs;
struct stat attribRhs; //File attribute structs
stat( lhs.c_str(), &attribLhs);
stat( rhs.c_str(), &attribRhs); //Get file stats
return attribLhs.st_mtime < attribRhs.st_mtime; //Compare last modification dates
}
};
From what I understand, this function can, and will be called multiple times against the same file, comparing it against different files. The file can be modified by external processes while sort is running; one of older files can become the newest in between two comparisons and turn up older than a rather old file, and later newer than one of newest files...
What will std::sort() do? I'm fine with some scarce ordering errors in the result. I'm not fine with a crash or a freeze (infinite loop) or other such unpleasantries. Am I safe?
Am I safe?
No.
std::sort requires a comparison with strict weak ordering and A<B, B<C, C<A violates that.
This violation incurs undefined behavior, and in practice, results in some of the worst kinds of undefined behavior.
It should also be noted that any sort algorithm that were written to work on elements that arbitrarily change ordering during the sort would be near-impossible. At no time would the algorithm know that the entire collection is currently sorted.
As other answers have already said, handing std::sort a comparator that doesn't satisfy the weak strict ordering requirement and is preserved when called multiple times with the same value will cause undefined behavior.
That doesn't only mean that the range may end up not correctly sorted, it may actually cause more serious problems, not only in theory, but also in practice. A common one is as you already said infinite loops in the algorithm, but it can also introduce crashes or vulnerabilities.
For example (I haven't checked whether other implementations behave similarly) I looked at libstdc++'s std::sort implementation, which as part of introsort uses insertion sort. The insertion sort calls a function __unguarded_linear_insert, see github mirror. This function performs a linear search on a range via the comparator without guarding for the end of the range, because the caller is supposed to have already verified that the searched item will fall into the range. If the result of the comparison changes between the guard comparison in the caller and the unguarded linear search, the iterator will be incremented out-of-bounds, which could produce a heap overrun or null dereference or anything else depending on the iterator type.
Demonstration see https://godbolt.org/z/8qajYEad7.
std::sort() assumes that the collection is sortable.
Relational algebra defines a set as sortable if:
it's reflexive, that is, a <= a is true
antisymmetric, that is: (a <= b and b <= a) <=> a = b
transitive, that is: (a <= b <= c) => a <= c
See the definition of partial ordering at page 7 of https://web.stanford.edu/class/archive/cs/cs103/cs103.1126/handouts/060%20Relations.pdf
In practice, reflexivity is not a necessary expectation, because, even though a < a is false, but a sorting algorithm may unnecessarily swap equal elements, so, it's strongly advisable to make it reflexive.
Your problem statement says that the relation over your collection is not transitive. But mind you, it is strictly transitive in any moment, the problem is, that during the (short) duration of your sorting algorithm elements may change their values.
This is not a well-defined behavior and in C++ it is undefined behavior.
So, the way I would approach your problem would be to bank on the fact that it's transitive at any time. Also, why would you measure the file sizes each time you compare them? Measuring files is I/O operation and slows down your process. It makes much more sense to measure the files only once, before you sort them, store the results into a collection whose items may change their order, but the values themselves will not change (file1's size will be measured before the algorithm and from there on, until the end of the sort will be unchanged in your set, even if it's no longer true).
The risk involved with this approach is that the result would be deprecated by a few milliseconds that passed since the measurements, a problem that you already specified as being acceptable.
Furthermore, if you need this sorting often, then it might make sense to do a sorting periodically (maybe once every 10 minutes, or the time interval you choose), cache the results and whenever you need the sort, just refer the cache.
Herb Sutter, in his proposal for the "spaceship" operator (section 2.2.2, bottom of page 12), says:
Basing everything on <=> and its return type: This model has major advantages, some unique to this proposal compared to previous proposals for C++ and the capabilities of other languages:
[...]
(6) Efficiency, including finally achieving zero-overhead abstraction for comparisons: The vast majority of comparisons are always single-pass. The only exception is generated <= and >= in the case of types that support both partial ordering and equality. For <, single-pass is essential to achieve the zero-overhead principle to avoid repeating equality comparisons, such as for struct Employee { string name; /*more members*/ }; used in struct Outer { Employeee; /*more members*/ }; – today’s comparisons violates zero-overhead abstraction because operator< on Outer performs redundant equality comparisons, because it performs if (e != that.e) return e < that.e; which traverses the equal prefix of
e.name twice (and if the name is equal, traverses the equal prefixes of other members of Employee twice as well), and this cannot be optimized away in general. As Kamiński notes, zero-overhead abstraction is a pillar of C++, and achieving it for comparisons for the first time is a significant advantage of this design based on <=>.
But then he gives this example (section 1.4.5, page 6):
class PersonInFamilyTree { // ...
public:
std::partial_ordering operator<=>(const PersonInFamilyTree& that) const {
if (this->is_the_same_person_as ( that)) return partial_ordering::equivalent;
if (this->is_transitive_child_of( that)) return partial_ordering::less;
if (that. is_transitive_child_of(*this)) return partial_ordering::greater;
return partial_ordering::unordered;
}
// ... other functions, but no other comparisons ...
};
Would define operator>(a,b) as a<=>b > 0 not lead to large overhead? (though in a different form than he discusses). That code would first test for equality, then for less, and finally for greater, rather than only and directly testing for greater.
Am I missing something here?
Would define operator>(a,b) as a<=>b > 0 not lead to large overhead?
It would lead to some overhead. The magnitude of the overhead is relative, though - in situations when costs of running comparisons are negligible in relation to the rest of the program, reducing code duplication by implementing one operator instead of five may be an acceptable trade-off.
However, the proposal does not suggest removing other comparison operators in favor of <=>: if you want to overload other comparison operators, you are free to do it:
Be general: Don’t restrict what is inherent. Don’t arbitrarily restrict a complete set of uses. Avoid special cases and partial features. – For example, this paper supports all seven comparison operators and operations, including adding three-way comparison via <=>. It also supports all five major comparison categories, including partial orders.
For some definition of large. There is overhead because in a partial ordering, a == b iff a <= b and b <= a. The complexity would be the same as a topological sort, O(V+E). Of course, the modern C++ approach is to write safe, clean and readable code and then optimizing. You can choose to implement the spaceship operator first, then specialize once you determine performance bottlenecks.
Generally speaking, overloading <=> makes sense when you're dealing with a type where doing all comparisons at once is either only trivially more expensive or has the same cost as comparing them differently.
With strings, <=> seems more expensive than a straight == test, since you have to subtract each pair of two characters. However, since you already had to load each pair of characters into memory, adding a subtraction on top of that is a trivial expense. Indeed, comparing two numbers for equality is sometimes implemented by compilers as a subtraction and test against zero. And even for compilers that don't do that, the subtract and compare against zero is probably not significantly less efficient.
So for basic types like that, you're more or less fine.
When you're dealing with something like tree ordering, you really need to know up-front which operation you care about. If all you asked for was ==, you really don't want to have to search the rest of the tree just to know that they're unequal.
But personally... I would never implement something like tree ordering with comparison operators to begin with. Why? Because I think that comparisons like that ought to be logically fast operations. Whereas a tree search is such a slow operation that you really don't want to do it by accident or at any time other than when it is absolutely necessary.
Just look at this case. What does it really mean to say that a person in a family tree is "less than" another? It means that one is a child of the other. Wouldn't it be more readable in code to just ask that question directly with is_transitive_child_of?
The more complex your comparison logic is, the less likely it is that what you're doing is really a "comparison". That there's probably some textual description that this "comparison" operation could be called that would be more readable.
Oh sure, such a class could have a function that returns a partial_order representing the relationship between the two objects. But I wouldn't call that function operator<=>.
But in any case, is <=> a zero-overhead abstraction of comparison? No; you can construct cases where it costs significantly more to compute the ordering than it does to detect the specific relation you asked for. But personally if that's the case, there's a good chance that you shouldn't be comparing such types through operators at all.
I ran across an issue whenever I was trying to sort a vector of objects that was resulting in an infinite loop. I am using a custom compare function that I passed in to the sort function.
I was able to fix the issue by returning false when two objects were equal instead of true but I don't fully understand the solution. I think it's because my compare function was violating this rule as outlined on cplusplus.com:
Comparison function object that,
taking two values of the same type
than those contained in the range,
returns true if the first argument
goes before the second argument in the
specific strict weak ordering it
defines, and false otherwise.
Can anyone provide a more detailed explanation?
The correct answer, as others have pointed out, is to learn what a "strict weak ordering" is. In particular, if comp(x,y) is true, then comp(y,x) has to be false. (Note that this implies that comp(x,x) is false.)
That is all you need to know to correct your problem. The sort algorithm makes no promises at all if your comparison function breaks the rules.
If you are curious what actually went wrong, your library's sort routine probably uses quicksort internally. Quicksort works by repeatedly finding a pair of "out of order" elements in the sequence and swapping them. If your comparison tells the algorithm that a,b is "out of order", and it also tells the algorithm that b,a is "out of order", then the algorithm can wind up swapping them back and forth over and over forever.
If you're looking for a detailed explanation of what 'strict weak ordering' is, here's some good reading material: Order I Say!
If you're looking for help fixing your comparison functor, you'll need to actually post it.
If the items are the same, one does not go before the other. The documentation was quite clear in stating that you should return false in that case.
The actual rule is specified in the C++ standard, in 25.3[lib.alg.sorting]/2
Compare is used as a function object which returns true if the first argument is less than the second, and false otherwise.
The case when the arguments are equal falls under "otherwise".
A sorting algorithm could easily loop because you're saying that A < B AND B < A when they're equal. Thus the algorithm might infinitely try to swap elements A and B, trying to get them in the correct order.
Strict weak ordering means a < b == true and when you return true for equality its a <= b == true. This requirement is needed for optimality for different sort algorithms.
I have an unusual sorting case that my googling has turned up little on. Here are the parameters:
1) Random access container. (C++ vector)
2) Generally small vector size (less than 32 objects)
3) Many objects have "do-not-care" relationships relative to each other, but they are not equal. (i.e. They don't care about which of them appears first in the final sorted vector, but they may compare differently to other objects.) To put it a third way (if it's still unclear), the comparison function for 2 objects can return 3 results: "order is correct," "order need to be fliped," or "do not care."
4) Equalities are possible, but will be very rare. (But this would probably just be treated like any other "do-not-care."
5) Comparison operator is far more expensive than object movement.
6) There is no comparison speed difference for determining that objects care or don't care about each other. (i.e. I don't know of a way to make a quicker comparison that simply says whether the 2 objects care about each other of not.)
7) Random starting order.
Whatever you're going to do, given your conditions I'd make sure you draw up a big pile of tests cases (eg get a few datasets and shuffle them a few thousand times) as I suspect it'd be easy to choose a sort that fails to meet your requirements.
The "do not care" is tricky as most sort algorithms depend on a strict ordering of the sort value - if A is 'less than or equal to' B, and B is 'less than or equal to' C, then it assumes that A is less than or equal to C -- in your case if A 'doesn't care' about B but does care about C, but B is less than C, then what do you return for the A-B comparison to ensure A will be compared to C?
For this reason, and it being small vectors, I'd recommend NOT using any of the built in methods as I think you'll get the wrong answers, instead I'd build a custom insertion sort.
Start with an empty target vector, insert first item, then for each subsequent item scan the array looking for the bounds of where it can be inserted (ie ignoring the 'do not cares', find the last item it must go after and the first it must go before) and insert it in the middle of that gap, moving everything else along the target vector (ie it grows by one entry each time).
[If the comparison operation is particularly expensive, you might do better to start in the middle and scan in one direction until you hit one bound, then choose whether the other bound is found moving from that bound, or the mid point... this would probably reduce the number of comparisons, but from reading what you say about your requirements you couldn't, say, use a binary search to find the right place to insert each entry]
Yes, this is basically O(n^2), but for a small array this shouldn't matter, and you can prove that the answers are right. You can then see if any other sorts do better, but unless you can return a proper ordering for any given pair then you'll get weird results...
You can't make the sorting with "don't care", it is likely to mess with the order of elemets. Example:
list = {A, B, C};
where:
A dont care B
B > C
A < C
So even with the don't care between A and B, B has to be greater than A, or one of those will be false: B > C or A < C. If it will never happen, then you need to treat them as equals instead of the don't care.
What you have there is a "partial order".
If you have an easy way to figure out the objects where the order is not "don't care" for a given objects, you can tackle this with basic topological sorting.
If you have a lot of "don't care"s (i.e. if you only have a sub-quadratic number of edges in your partial ordering graph), this will be a lot faster than ordinary sorting - however, if you don't the algorithm will be quadratic!
I believe a selection sort will work without modification, if you treat the "do-not-care" result as equal. Of course, the performance leaves something to be desired.