removing nested paths from vector of strings

removing nested paths from vector of strings - c++

I have an std::vector<std::string>paths where each entry is a path and I want to remove all the paths that are sub-directories of another one.
If for example I have root/dir1/, root/dir1/sub_dir/ and root/dir2/, the result should be root/dir1/, root/dir2/.
The way I've implemented it is by using std::sort + std::unique with a predicate that checks if string2 starts with string1.
std::vector<std::string> paths = getPaths();
std::sort(paths.begin(), paths.end());
const auto to_erase = std::unique(paths.begin(), paths.end(), [](const std::string & s1, const std::string & s2) {
return (s2.starts_with(s1) || s1.starts_with(s2));
});
paths.erase(to_erase, paths.end());
But since the predicate should be symetrycal I wonder if in some implementation std::unique iterate from end to start, and in that case the result will be wrong.

Your predicate is symmetric.
Let p be your predicate (the lambda), and a and b some strings, different from each other, but such that p(a, b) is true. Then either a.starts_with(b) or b.starts_with(a).
If a.starts_with(b), then p(b, a) is true because s2.starts_with(s1) is true in the lambda. Similarly, if b.starts_with(a), then p(b, a) is true because s1.starts_with(s2) is true in the lambda.
So, if p(a, b), then p(b, a) (and vice versa), which is the definition of a symmetric predicate.
It's not transitive though (p("a/b", "a") and p("a", "a/c") but not p("a/b", "a/c")), but I can't see a way this could pose a problem in practice. It could definitely lead to different results if the input isn't sorted, but yours is.
So your implementation is probably fine.

Related

std::algorithm to make this more expressive

I'm working on my own blockchain implementation in C++17.
To learn as much as possible, I would like to reduce loops as least as possible, moving to the alternative (and more expressive) loops within the std::algorithm library.
In a nullshell, within the Blockchain algorithm, each block contains it's own hash and the hash of the previous block (except the Genesis-Block which doesn't contain a previous block [first in the container]).
I want to implement a Validate function in the Blockchain object that takes pairs of blocks (which are contained in a std::vector), and checks the hashes (and previous hashes) to make sure they haven't been tampered with.
Current code (works):
bool Blockchain::Validate() const
{
// HASH is in ASCII
std::array<unsigned char, 64> calculated_hash{};
// No underflow here since Genesis-Block is created in the CTOR
for (auto index = this->chain_.size() - 1; index > 0; --index)
{
const auto& current_block{this->chain_[index]};
const auto& previous_block{this->chain_[index - 1]};
SHA256{previous_block}.FillHash(calculated_hash);
if (!std::equal(std::begin(calculated_hash), std::end(calculated_hash),
std::begin(current_block.GetPreviousHash()), std::end(current_block.GetPreviousHash())))
{
return false;
}
}
return true;
}
I would like to know if there's an algorithm that works somehow the way Python does its ", ".join(arr) for strings, which appends commas between each adjacent pair in the array, but instead will check until a certain condition returns false in which case stops running.
TLDR:
If this is my container:
A B C D E F G
I would like to know if theres an algorithm that asserts a condition in adjacent pairs: (A, B), (B, C), (C, D), (D, E), (E, F), (F, G)
And will stop if a condition has failed, for example:
A with B -> True
B with C -> True
C with D -> False
So the algorithm will return false. (Sounds like an adjacent implementation of std::all_of).
Does a std::algorithm like this exist? Thanks!

If you have some range v where you want to check each adjacent element for some condition, and return early, you can use std::adjacent_find.
First write a lambda that compares adjacent elements:
auto comp = [](auto left, auto right)
{
return // the Negation of the actual condition
}
Note that the negation is needed, so that you return early when you reach the actual false case. So in your case, A,B and B,C would compare false, and C,D would compare true.
Now you can use the lambda like this:
return std::adjacent_find(std::begin(v), std::end(v), comp) == std::end(v);
In your manual loop, you actually appear to be iterating in reverse, in which case you can write:
return std::adjacent_find(std::rbegin(v), std::rend(v), comp) == std::rend(v);

How does std::set comparator function work?

Currently working on an algorithm problems using set.
set<string> mySet;
mySet.insert("(())()");
mySet.insert("()()()");
//print mySet:
(())()
()()()
Ok great, as expected.
However if I put a comp function that sorts the set by its length, I only get 1 result back.
struct size_comp
{
bool operator()(const string& a, const string& b) const{
return a.size()>b.size();
}
};
set<string, size_comp> mySet;
mySet.insert("(())()");
mySet.insert("()()()");
//print myset
(())()
Can someone explain to me why?
I tried using a multi set, but its appending duplicates.
multiset<string,size_comp> mSet;
mSet.insert("(())()");
mSet.insert("()()()");
mSet.insert("()()()");
//print mset
"(())()","()()()","()()()"

std::set stores unique values only. Two values a,b are considered equivalent if and only if
!comp(a,b) && !comp(b,a)
or in everyday language, if a is not smaller than b and b is not smaller than a. In particular, only this criterion is used to check for equality, the normal operator== is not considered at all.
So with your comparator, the set can only contain one string of length n for every n.
If you want to allow multiple values that are equivalent under your comparison, use std::multiset. This will of course also allow exact duplicates, again, under your comparator, "asdf" is just as equivalent to "aaaa" as it is to "asdf".
If that does not make sense for your problem, you need to come up with either a different comparator that induces a proper notion of equality or use another data structure.
A quick fix to get the behavior you probably want (correct me if I'm wrong) would be introducing a secondary comparison criterion like the normal operator>. That way, we sort by length first, but are still able to distinguish between different strings of the same length.
struct size_comp
{
bool operator()(const string& a, const string& b) const{
if (a.size() != b.size())
return a.size() > b.size();
return a > b;
}
};

The comparator template argument, which defaults to std::less<T>, must represent a strict weak ordering relation between values in its domain.
This kind of relation has some requirements:
it's not reflexive (x < x yields false)
it's asymmetric (x < y implies that y < x is false)
it's transitive (x < y && y < z implies x < z)
Taking this further we can define equivalence between values in term of this relation, because if !(x < y) && !(y < x) then it must hold that x == y.
In your situation you have that ∀ x, y such that x.size() == y.size(), then both comp(x,y) == false && comp(y,x) == false, so since no x or y is lesser than the other, then they must be equal.
This equivalence is used to determine if two items correspond to the same, thus ignoring second insertion in your example.
To fix this you must make sure that your comparator never returns false for both comp(x,y) and comp(y,x) if you don't want to consider x equal to y, for example by doing
auto cmp = [](const string& a, const string& b) {
if (a.size() != b.size())
return a.size() > b.size();
else
return std::less()(a, b);
}
So that for input of same length you fallback to normal lexicographic order.

This is because equality of elements is defined by the comparator. An element is considered equal to another if and only if !comp(a, b) && !comp(b, a).
Since the length of "(())()" is not greater, nor lesser than the length of "()()()", they are considered equal by your comparator. There can be only unique elements in a std::set, and an equivalent object will overwrite the existing one.
The default comparator uses operator<, which in the case of strings, performs lexicographical ordering.
I tried using a multi set, but its appending duplicates.
Multiset indeed does allow duplicates. Therefore both strings will be contained despite having the same length.

size_comp considers only the length of the strings. The default comparison operator uses lexicographic comparison, which distinguishes based on the content of the string as well as the length.

Is there a standard way to compare two ranges using a predicate?

Given...
string a; // = something.
string b; // = something else. The two strings are of equal length.
string::size_type score = 0;
...what I would like to do is something like...
compare(a.cbegin(), a.cend(), b.cbegin(), b.cend(), [&score](const char c1, const char c2) -> void {
if (c1 == c2) { // actually a bit more complicated in real life
score++;
}
});
...but as far as I can tell there doesn't seem to be a std::compare. The nearest seems to be std::lexicographical_compare but that doesn't quite match. Ditto for std::equal. Is there really nothing appropriate in the standard library? I suppose I could write my own (or use a plain old C style loop which is what I did but how boring :-) but I would think what I'm doing is rather common so that would be a strange omission IMO. So my question is am I missing something?

Is there a standard algorithm to compare to ranges using a predicate? Yes, std::equal, or std::lexicographical_compare.
Is there a standard algorithm to do what your code is doing? std::inner_product can be made to do it:
std::string a = "something";
std::string b = "samething";
auto score = std::inner_product(
a.begin(), a.end(), b.begin(), 0,
[](int x, bool b) { return x + b; },
[](char a, char b) { return a == b; });
but I would think what I'm doing is rather common
No, not really. If you just want to run a general function over corresponding elements in two ranges, the appropriate algorithm would be for_each with a zip iterator. If anything's missing from the standard, it's the zip iterator. We don't need a special algorithm for this purpose.

It looks a bit as if you are looking for std::mismatch() which yields the iterators where the first difference is found (or the end, of course). It doesn't compute the difference, however, because there isn't a subtraction defined for all types. Like the other algorithms std::mismatch() comes in a form with a predicate and one without a predicate.

Thankyou to all that answered. What I was trying to do (more for my edification than anything else really) was to replace this...
for (string::const_iterator c1 = a.begin(), c2 = b.begin(); c1 != a.end(); ++c1, ++c2) {
if (*c1 == *c2) {
score++;
}
}
...with snazzy new c++11 stuff :-) I looked at equal, lexicographical_compare etc. but I guess what tripped me up was that they take a boolean predicate and if it returns false processing stops whereas I needed to process the entire ranges each time. Then after reading the answers you gave me I had the epiphany that just because there is a return value doesn't mean I can't throw it away if I don't need it. By simply always returning true in my lambda I can use any of the above mentioned algorithms and they will run to the end of the range.
The only thing is as I would be using the algorithms in a different way than their names suggest, it might cause maintainance problems in the future so I will just stick to my boring old loop for now but I learned something new so thanks once again.

How to sort a vector of structs based on a vector<string> within the vector to be sorted?

What is the best way to alphabetically sort a vector of structures based on the first word in every vector of all the structures in the vector of structures?
struct sentence{
vector<string> words;
};
vector<sentence> allSentences;
In other words, how to sort allSentences based on words[0]?
EDIT: I used the following solution:
bool cmp(const sentence& lhs, const sentence & rhs)
{
return lhs.words[0] < rhs.words[0];
}
std::sort(allSentences.begin(), allSentences.end(), cmp);

Provide a suitable comparison binary function and pass it on to std::sort. For example
bool cmp(const sentence& lhs, const sentence & rhs)
{
return lhs.words[0] < rhs.words[0];
}
then
std::sort(allSentences.begin(), allSentences.end(), cmp);
Alternatively, in C++11 you can use a lambda anonymous function
std::sort(allSentences.begin(), allSentences.end(),
[](const sentence& lhs, const sentence & rhs) {
return lhs.words[0] < rhs.words[0];}
);

You need some comparison function that you can pass to std::sort:
bool compare(const sentence& a, const sentence& b)
{
return a.words[0] < b.words[0];
}
As you can see, it takes two sentences and returns true if the first word of the first sentence is "less than" the first word of the second sentence.
Then you can sort allSentences very easily:
std::sort(allSentences.begin(), allSentences.end(), compare);
Of course, using this comparison means that sentences like {"hello", "world"} and {"hello", "friend"} will compare equal. But that's what you've asked for.

Generally, there are three different types of scenarios for comparison implementations you should consider.
A comparison of your object that makes always sense. It's independent from the scenario in which you want to compare objects. Then: Implement operator< for your class. This operator is used whenever two objects are compared (with <, which the standard algorithms do). (For single scenarios, you can still "overwrite" this behavior using the other methods below).
For this, extend your class with the following function:
struct sentence{
vector<string> words;
bool operator<(const sentence &other) const {
return this->words[0] < other.words[0];
}
};
Then, just call the standard sorting algorithm on your vector of sentences without other arguments:
std::sort(allSentences.begin(), allSentences.end());
However, your scenario doesn't sound like this is the best method, since comparing by the first word is something you don't want to have always, maybe only in one case.
A comparison of your object which will be used only once. In C++11, you have lambda functions (anonymous, literally inlined functions), which can be passed directly to the algorithm function in which it will be used, like std::sort in this scenario. This is my favorite solution:
// Sort lexicographical by first word
std::sort(allSentences.begin(), allSentences.end(),
[](const sentence& a, const sentence& b) {
a.words[0] < b.words[0];
});
In C++03, where you don't have lambdas, use to the 3rd solution:
A set of different, re-usable comparison methods, maybe a parameterized comparison function. Examples are: Compare by the first word, compare by length, compare by something else... In this case, implement the comparison function(s) either as free-standing functions and use function pointers, or implement them as functors (which can be parameterized). Also, lambdas stored in variables do the job in this case.
This method has the advantage to name the comparison methods, giving them a meaning. If you use different comparisons for the same object, but re-use them, this is a huge advantage:
// Lexicographical comparison by the first word only
bool compareLexByFirstWord(const sentence& a, const sentence& b) {
return a.words[0] < b.words[0];
}
// Lexicographical comparison by all words
bool compareLex(const sentence& a, const sentence& b) {
return a.words < b.words;
}
// Decide which behavior to use when actually using the comparison:
std::sort(sentence.begin(), sentence.end(), compareLexByFirstWord);
std::sort(sentence.begin(), sentence.end(), compareLex);

STL algorithm for merge with addition

I was using stl::merge to put two sorted collections into one.
But my object has a natural key; and a defined addition semantic, so what I am after is a merge_and_sum that would not just merge the two collections into a single N+M length collection, but if the operator== on the object returned true, would then operator+ them.
I have implemented it thus
template<class _InIt1, class _InIt2, class _OutIt>
_OutIt merge_and_sum(_InIt1 _First1, _InIt1 _Last1, _InIt2 _First2, _InIt2 _Last2, _OutIt _Dest )
{ // copy merging ranges, both using operator<
for (; _First1 != _Last1 && _First2 != _Last2; ++_Dest)
{
if ( *_First2 < *_First1 )
*_Dest = *_First2, ++_First2;
else if ( *_First2 == *_First1)
*_Dest = *_First2 + *_First1, ++_First1, ++_First2;
else
*_Dest = *_First1, ++_First1;
}
_Dest = copy(_First1, _Last1, _Dest); // copy any tail
return (copy(_First2, _Last2, _Dest));
}
But was wondering if I have reinvented something that is composable from the other algorithms.

It sounds like your collections are like multisets with duplicates collapsed by your + operator (maybe just summing the multiplicities instead of keeping redundant copies). I assume so, because you're not changing the sorting order when you +, so + isn't affecting your key.
You should use your implementation. There's nothing in STL that will do it as efficiently. The closest semantic I can think of is standard merge followed by unique_copy. You could almost get unique_copy to work with a side-effectful comparison operator, but that would be extremely ill advised, as the implementation doesn't promise to only compare things directly vs. via a value-copied temporary (or even a given number of times).
Your type and variable names are unpleasantly long ;)

You could use std::merge with an output iterator of your own creation, which does the following in operator=. I think this ends up making more calls to operator== than your version, though, so unless it works out as less code it's probably not worth it.
if ((mylist.size() > 0) && (newvalue == mylist.back())) {
mylist.back() += newvalue;
} else {
mylist.push_back(newvalue);
}
(Actually, writing a proper output iterator might be more fiddly than that, I can't remember. But I hope you get the general idea).
mylist is a reference to the collection you're merging into. If the target doesn't have back(), then you'll have to buffer one value in the output iterator, and only write it once you see a non-equal value. Then define a flush function on the output iterator to write the last value, and call it at the end. I'm pretty sure that in this case it is too much mess to beat what you've already done.

Well, your other option would be to use set_symmetric_difference to get the elements that were different, then use set_intersection to get the ones that are the same, but twice. Then add them together and insert into the first.
typedef set<MyType, MyComp> SetType;
SetType merge_and_add(const SetType& s1, const SetType& s2)
{
SetType diff;
set_symmetric_difference(s1.begin(), s1.end(), s2.begin(), s2.end(), inserter(s2, s2.end());
vector<SetType::value_type> same1, same2;
set_intersection(s1.begin(), s1.end(), s2.begin(), s2.end(), back_inserter(same1));
set_intersection(s2.begin(), s2.end(), s1.begin(), s1.end(), back_inserter(same2));
transform(same1.begin(), same1.end(), same2.begin(), inserter(diff, diff.begin()), plus<SetType::value_type, SetType::value_type>());
return diff;
}
Side note! You should stick to either using operator==, in which case you should use an unordered_set, or you should use operator< for a regular set. A set is required to be partially ordered which means 2 entries are deemed equivalent if !(a < b) && !(b < a). So even if your two objects are unequal by operator==, if they satisfy this condition the set will consider them duplicates. So for your function supplied above I highly recommend refraining from using an == comparison.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

removing nested paths from vector of strings - c++

Related

std::algorithm to make this more expressive

How does std::set comparator function work?

Is there a standard way to compare two ranges using a predicate?

How to sort a vector of structs based on a vector<string> within the vector to be sorted?

STL algorithm for merge with addition

Categories

Resources