Intersection Between Two String Sets with Substring Compare - c++

I know this is bike shedding but is there a way to get the set of strings C, between two (sorted) sets A,B of strings, where B is a sub string of A, with a complexity better than of A.size * B.size * comp_substr, as the naive solution I came up?
std::copy_if(devices.cbegin(), devices.cend(),
[&comport_keys] (const auto& v) {
return std::any_of(comport_keys.begin(),comport_keys.end(), [&v](auto& k) {
return v.find(k) != std::string::npos;
The easier case of just where B is a string of A, with std::set_intersection would be pretty simple with a complexity of (A.size + B.size) * comp_substr, with would be even better if one had to sort it before (n * log(n)), but I don't know how to write the compare function for it, or rather the sort of both.
#include <boost/test/included/unit_test.hpp>
#include <vector>
#include <string>
#include <algorithm>
#include <iterator>
#include <set>
std::vector<std::string> devices{
}, ports{};
const std::set<std::string> comport_keys{
std::sort(devices.begin(), devices.end());
std::set_intersection(devices.cbegin(), devices.cend(),
comport_keys.cbegin(), comport_keys.cend(),
[&comport_keys] (auto a, auto b) {
return a.find(b) != std::string::npos; //This is wrong
const std::vector<std::string>test_set {
BOOST_TEST(ports == test_set);

Say we have two sets of strings: A and B. B contains a set of potential prefixes for the strings in A. So we want to take each element a from A and try to match it with all potential prefixes of B.
If we find a matching prefix, we store our result a in C. The trivial solution works in O(|A| |B|). You ask: Can we optimize this?
You said, B is already sorted. Then we can build a generalised prefix tree on B in linear time and query it with each string in A to solve it in O(|A|+|B). The problem is, sorting B takes O(|B| log|B|) and the tree is non-trivial.
So I provide a simple solution with O(|A| log|B|) which is more efficient than O(|A|+|B|) if |A| is small, like in your example. B is still assumed to be sorted (the sorting is really the upper bound here...).
validate_prefixes(const std::multiset<std::string>& keys) {
auto itb = keys.begin(), it = itb;
if(it == keys.end()) return false; //no keys
for(++it; it != keys.end(); ++it) {
if( (*it).find(*itb) != std::string::npos ) return false; //redundant keys
return true;
copy_from_intersecting_prefixes(const std::vector<std::string>& data,
std::multiset<std::string>& prefix_keys,
std::vector<std::string>& dest, bool check = false) {
if(check && !validate_prefixes(prefix_keys)) return false;
for(auto it_data = data.begin(); it_data != data.end(); ++it_data) {
auto ptr = prefix_keys.insert(*it_data), ptrb = ptr;
if(ptrb != prefix_keys.begin()) { //if data is at the start, there is no prefix
if( (*ptr).find(*(--ptrb)) != std::string::npos ) dest.push_back(*it_data);
} //Complexity: O(|data|) * O( log(|prefix_keys|) ) * O(substr) = loop*insert*find
return check;
//.... in main()
std::multiset<std::string> tmp(comport_keys.begin(), comport_keys.end()); //copy const
copy_from_intersecting_prefixes(devices, tmp, ports);
validate_prefixes enforces a precondition. It checks if we have at least one valid prefix and that the keys are not self-matching. E.g. we could have keys cu and cu2, but cu is a prefix for cu2, so they can't be both valid prefixes, either cu is too general or cu2 too specific. If we try to match cu3 with cu and cu2 this is inconsistent. Here validate_prefixes(comport_keys) returns true, but it might be nice to check it automatically.
copy_from_intersecting_prefixes does the actual asked work. It iterates over A, and puts a inside the ordered B. The prefix is smaller than prefix+ending, so if a corresponding prefix exists, it will occur before a in B. Because the keys are not self-matching, we know that the prefix will precede a in B. So we decrement the iterator from a and compare. Note that prefix might equal a, so we need multiset.


fast way to compare two vector containing strings

I have a vector of strings I that pass to my function and I need to compare it with some pre-defined values. What is the fastest way to do this?
The following code snippet shows what I need to do (This is how I am doing it, but what is the fastest way of doing this):
bool compare(vector<string> input1,vector<string> input2)
if(input1.size() != input2.size()
return false;
for(int i=0;i<input1.siz();i++)
if(input1[i] != input2[i])
return false;
return true;
int compare(vector<string> inputData)
if (compare(inputData,{"Apple","Orange","three"}))
return 129;
if (compare(inputData,{"A","B","CCC"}))
return 189;
if (compare(inputData,{"s","O","quick"}))
return 126;
if (compare(inputData,{"Apple","O123","three","four","five","six"}))
return 876;
if (compare(inputData,{"Apple","iuyt","asde","qwe","asdr"}))
return 234;
return 0;
Can I compare two vector like this:
return 129;
You are asking what is the fastest way to do this, and you are indicating that you are comparing against a set of fixed and known strings. I would argue that you would probably have to implement it as a kind of state machine. Not that this is very beautiful...
if (inputData.size() != 3) return 0;
if (inputData[0].size() == 0) return 0;
const char inputData_0_0 = inputData[0][0];
if (inputData_0_0 == 'A') {
// possibly "Apple" or "A"
} else if (inputData_0_0 == 's') {
// possibly "s"
} else {
return 0;
The weakness of your approach is its linearity. You want a binary search for teh speedz.
By utilising the sortedness of a map, the binaryness of finding in one, and the fact that equivalence between vectors is already defined for you (no need for that first compare function!), you can do this quite easily:
std::map<std::vector<std::string>, int> lookup{
{{"Apple","Orange","three"}, 129},
{{"A","B","CCC"}, 189},
// ...
int compare(const std::vector<std::string>& inputData)
auto it = lookup.find(inputData);
if (it != lookup.end())
return it->second;
return 0;
Note also the reference passing for extra teh speedz.
(I haven't tested this for exact syntax-correctness, but you get the idea.)
However! As always, we need to be context-aware in our designs. This sort of approach is more useful at larger scale. At the moment you only have a few options, so the addition of some dynamic allocation and sorting and all that jazz may actually slow things down. Ultimately, you will want to take my solution, and your solution, and measure the results for typical inputs and whatnot.
Once you've done that, if you still need more speed for some reason, consider looking at ways to reduce the dynamic allocations inherent in both the vectors and the strings themselves.
To answer your follow-up question: almost; you do need to specify the type:
// new code is here
// ||||||||||||||||||||||||
if (inputData == std::vector<std::string>{"Apple","Orange","three"})
return 129;
As explored above, though, let std::map::find do this for you instead. It's better at it.
One key to efficiency is eliminating needless allocation.
Thus, it becomes:
bool compare(
std::vector<std::string> const& a,
std::initializer_list<const char*> b
) noexcept {
return std::equal(begin(a), end(a), begin(b), end(b));
Alternatively, make them static const, and accept the slight overhead.
As an aside, using C++17 std::string_view (look at boost), C++20 std::span (look for the Guideline support library (GSL)) also allows a nicer alternative:
bool compare(std::span<std::string> a, std::span<std::string_view> b) noexcept {
return a == b;
The other is minimizing the number of comparisons. You can either use hashing, binary search, or manual ordering of comparisons.
Unfortunately, transparent comparators are a C++14 thing, so you cannot use std::map.
If you want a fast way to do it where the vectors to compare to are not known in advance, but are reused so can have a little initial run-time overhead, you can build a tree structure similar to the compile time version Dirk Herrmann has. This will run in O(n) by just iterating over the input and following a tree.
In the simplest case, you might build a tree for each letter/element. A partial implementation could be:
typedef std::vector<std::string> Vector;
typedef Vector::const_iterator Iterator;
typedef std::string::const_iterator StrIterator;
struct Node
std::unique_ptr<Node> children[256];
std::unique_ptr<Node> new_str_child;
int result;
bool is_result;
Node root;
int compare(Iterator vec_it, Iterator vec_end, StrIterator str_it, StrIterator str_end, const Node *node);
int compare(const Vector &input)
return compare(input.begin(), input.end(), input.front().begin(), input.front().end(), &root);
int compare(Iterator vec_it, Iterator vec_end, StrIterator str_it, StrIterator str_end, const Node *node)
if (str_it != str_end)
// Check next character
auto next_child = node->children[(unsigned char)*str_it].get();
if (next_child)
return compare(vec_it, vec_end, str_it + 1, str_end, next_child);
else return -1; // No string matched
// At end of input string
if (vec_it != vec_end)
auto next_child = node->new_str_child.get();
if (next_child)
return compare(vec_it, vec_end, vec_it->begin(), vec_it->end(), next_child);
else return -1; // Have another string, but not in tree
// At end of input vector
if (node->is_result)
return node->result; // Got a match
else return -1; // Run out of input, but all possible matches were longer
Which can also be done without recursion. For use cases like yours you will find most nodes only have a single success value, so you can collapse those into prefix substrings, to use the OP example:
|-"pple" - new vector - "O" - "range" - new vector - "three" - ret 129
| |- "i" - "uyt" - new vector - "asde" ... - ret 234
| |- "0" - "123" - new vector - "three" ... - ret 876
|- new vector "B" - new vector - "CCC" - ret 189
"s" - new vector "O" - new vector "quick" - ret 126
you could make use of std::equal function like below :
bool compare(vector<string> input1,vector<string> input2)
if(input1.size() != input2.size()
return false;
return std::equal(input1.begin(), input2.end(), input2.begin())
Can I compare two vector like this
The answer is No, you need compare a vector with another vector, like this:
vector<string>data = {"ab", "cd", "ef"};
if(data == vector<string>{"ab", "cd", "efg"})
cout << "Equal" << endl;
cout << "Not Equal" << endl;
What is the fastest way to do this?
I'm not an expert of asymptotic analysis but:
Using the relational operator equality (==) you have a shortcut to compare two vectors, first validating the size and, second, each element on them. This way provide a linear execution (T(n), where n is the size of vector) which compare each item of the vector, but each string must be compared and, generally, it is another linear comparison (T(m), where m is the size of the string).
Suppose that each string has de same size (m) and you have a vector of size n, each comparison could have a behavior of T(nm).
if you want a shortcut to compare two vector you can use the
relational operator equality.
If you want an program which perform a fast comparison you should look for some algorithm for compare strings.

Search for a variable from vector of objects

struct ABC
int a;
string b;
I have a vector of objects to the above struct. And want to search the vector based on variable "b"?
I have logic as below.
vector<ABC> vec = ...;//vec has my objects
for(vector<ABC>::iterator it = vec.begin();
it != vec.end();
if(search_str == (it->b))//search string is my string which i need to search
I have extensively tested the above code and it works. I want to know if there is a better way to achieve this. Maybe using find().
Simple, readable, lifted from Sam's comment:
auto found = std::find_if(vec.begin(), vec.end(), [&](auto const &e) {
return e.b == search_str;
And now found is an iterator to the first matching element, or vec.end() if none was found.
You can also use range based for loops in some cases, give you much clearer code.
for (auto const &p : vec)
if (p == search_str)
//--- Handle the find ---//
//if you want to stop...
One of the better method to compare two strings is using compare method in C++.
Suppose you want to compare two strings S1 and S2. You can use equality operator( == ) as you have already used.
But using std::string::compare() function has it's own benefit.
We can not only compare two strings but can also check if one is less or greater.
std::string::compare() function return an int:
zero if S1 is equal to S2.
less than zero if S1 is less than S2.
greater than zero if S1 is greater than S2.
So your code can be formatted as:
vector<ABC> vec = ...;//vec has my objects
for(vector<ABC>::iterator it = vec.begin(); it != vec.end(); ++it){
//match found

How to cut off parts of a string, which every string in a collection has

My currently problem is the following:
I have a std::vector of full path names to files.
Now i want to cut off the common prefix of all string.
If I have these 3 strings in the vector:
I would like to cut off /home/ from every string in the vector.
Is there any method to achieve this in general?
I want an algorithm that drops the common prefix of all string.
I currently only have an idea which solves this problem in O(n m) with n strings and m is the longest string length, by just going through every string with every other string char by char.
Is there a faster or more elegant way solving this?
This can be done entirely with std:: algorithms.
sort the input range if not already sorted. The first and last paths in the sorted range
will be the most dissimilar. Best case is O(N), worst case O(N + N.logN)
use std::mismatch to determine the larges common sequence between the
two most dissimilar paths [insignificant]
run through each path erasing the first COUNT characters where COUNT is the number of characters in the longest common sequence. O (N)
Best case time complexity: O(2N), worst case O(2N + N.logN) (can someone check that?)
#include <iostream>
#include <algorithm>
#include <string>
#include <vector>
std::string common_substring(const std::string& l, const std::string& r)
return std::string(l.begin(),
std::mismatch(l.begin(), l.end(),
r.begin(), r.end()).first);
std::string mutating_common_substring(std::vector<std::string>& range)
if (range.empty())
return std::string();
if (not std::is_sorted(range.begin(), range.end()))
std::sort(range.begin(), range.end());
return common_substring(range.front(), range.back());
std::vector<std::string> chop(std::vector<std::string> samples)
auto str = mutating_common_substring(samples);
for (auto& s : samples)
s.erase(s.begin(), std::next(s.begin(), str.size()));
return samples;
int main()
std::vector<std::string> samples = {
samples = chop(std::move(samples));
for (auto& s : samples)
std::cout << s << std::endl;
Here's an alternate `common_substring' which does not require a sort. time complexity is in theory O(N) but whether it's faster in practice you'd have to check:
std::string common_substring(const std::vector<std::string>& range)
if (range.empty())
return {};
return std::accumulate(std::next(range.begin(), 1), range.end(), range.front(),
[](auto const& best, const auto& sample)
return common_substring(best, sample);
Elegance aside, this is probably the fastest way since it avoids any memory allocations, performing all transformations in-place. For most architectures and sample sizes, this will matter more than any other performance consideration.
#include <iostream>
#include <vector>
#include <string>
void reduce_to_common(std::string& best, const std::string& sample)
best.erase(std::mismatch(best.begin(), best.end(),
sample.begin(), sample.end()).first,
void remove_common_prefix(std::vector<std::string>& range)
if (range.size())
auto iter = range.begin();
auto best = *iter;
for ( ; ++iter != range.end() ; )
reduce_to_common(best, *iter);
auto prefix_length = best.size();
for (auto& s : range)
s.erase(s.begin(), std::next(s.begin(), prefix_length));
int main()
std::vector<std::string> samples = {
for (auto& s : samples)
std::cout << s << std::endl;
You have to search every string in the list. However you don't need to compare all the characters in every string. The common prefix can only get shorter, so you only need to compare with "the common prefix so far". I don't think this changes the big-O complexity - but it will make quite a difference to the actual speed.
Also, these look like file names. Are they sorted (bearing in mind that many filesystems tend to return things in sorted order)? If so, you only need to consider the first and last elements. If they are probably pr mostly ordered, then consider the common prefix of the first and last, and then iterate through all the other strings shortening the prefix further as necessary.
You just have to iterate over every string. You can only avoid iterating over the full length of strings needlessly by exploiting the fact, that the prefix can only shorten:
#include <iostream>
#include <string>
#include <vector>
std::string common_prefix(const std::vector<std::string> &ss) {
if (ss.empty())
// no prefix
return "";
std::string prefix = ss[0];
for (size_t i = 1; i < ss.size(); i++) {
size_t c = 0; // index after which the string differ
for (; c < prefix.length(); c++) {
if (prefix[c] != ss[i][c]) {
// strings differ from character c on
if (c == 0)
// no common prefix
return "";
// the prefix is only up to character c-1, so resize prefix
return prefix;
void strip_common_prefix(std::vector<std::string> &ss) {
std::string prefix = common_prefix(ss);
if (prefix.empty())
// no common prefix, nothing to do
// drop the common part, which are always the first prefix.length() characters
for (std::string &s: ss) {
s = s.substr(prefix.length());
int main()
std::vector<std::string> ss { "/home/user/foo.txt", "/home/user/bar.txt", "/home/baz.txt"};
for (std::string &s: ss)
std::cout << s << "\n";
Drawing from the hints of Martin Bonner's answer, you may implement a more efficient algorithm if you have more prior knowledge on your input.
In particular, if you know your input is sorted, it suffices to compare the first and last strings (see Richard's answer).
i - Find the file which has the least folder depth (i.e. baz.txt) - it's root path is home
ii - Then go through the other strings to see if they start with that root.
iii - If so then remove root from all the strings.
Start with std::size_t index=0;. Scan the list to see if characters at that index match (note: past the end does not match). If it does, advance index and repeat.
When done, index will have the value of the length of the prefix.
At this point, I'd advise you to write or find a string_view type. If you do, simply create a string_view for each of your strings str with start/end of index, str.size().
Overall cost: O(|prefix|*N+N), which is also the cost to confirm that your answer is correct.
If you don't want to write a string_view, simply call str.erase(str.begin(), str.begin()+index) on each str in your vector.
Overall cost is O(|total string length|+N). The prefix has to be visited in order to confirm it, then the tail of the string has to be rewritten.
Now the cost of the breadth-first is locality, as you are touching memory all over the place. It will probably be more efficient in practice to do it in chunks, where you scan the first K strings up to length Q and find the common prefix, then chain that common prefix plus the next block. This won't change the O-notation, but will improve locality of memory reference.
for(vector<string>::iterator itr=V.begin(); itr!=V.end(); ++itr)

Correct Iteration with deletion in List syntax

I'm currently writing a program that uses lists at a point in said program i want to iterate through 3 three lists a, b and c, and delete any element in b and c if it appears in a. Im doing it as such:
//remove elements from OpenList that are in ClosedList
for(list<Node> :: iterator cloIt = ClosedList.begin(); cloIt != ClosedList.end(); cloIt++)
for(list<Node> :: iterator opIt = OpenList.begin(); opIt != OpenList.end(); opIt++)
for(list<Node> :: iterator neigIt = Neighbour.begin(); neigIt != Neighbour.end(); neigIt++)
if (*cloIt == *opIt)
opIt = OpenList.erase(opIt);
if (*cloIt == *neigIt)
neigIt = Neighbour.erase(neigIt);
However this is causing me to get an "List iterator not incrementable" error
How could i fix this?
From your erase call, you want to
remove OpenList items if they are found in ClosedList list
remove Neighbour items if they are found from ClosedListlist
You'd better separate code into two loops, instead of nested loops, for example:
1.remove OpenList items if they are found in ClosedList list
for(auto cloIt = ClosedList.begin(); cloIt != ClosedList.end(); ++cloIt)
OpenList.remove_if([&](const Node& n){ return n == *colIt; } );
2.remove Neighbour items if they are found from ClosedListlist
for(auto cloIt = ClosedList.begin(); cloIt != ClosedList.end(); ++cloIt)
Neighbour.remove_if([&](const Node& n){ return n == *colIt; } );
Obvious previous code is duplicated, you could write a common function for that:
void RemoveItem(std::list<Node>& node_list, std::list<Node>& node_list2)
for(auto cloIt = node_list2.begin(); cloIt != node_list2.end(); ++cloIt)
node_list.remove_if([&](const Node& n){ return n == *colIt; } );
Now you could call:
RemoveItem(OpenList, CloseList);
RemoveItem(Neighbour, CloseList);
Don't forget to define operator== for Node type, for example if node has getId interface:
bool operator==(const Node& lhs, const Node& rhs)
return lhs.getId() == rhs.getId();
How could i fix this?
The best way is to use standard algorithms and let them do the iteration, search, and/or the conditional removal for you.
You could use std::list's remove_if() member function with a lambda predicate that checks if the element is contained in list a:
#include <algorithm>
// ...
[&a] (Node const& n)
return (std::find(begin(a), end(a), n) != a.end());
Same for removing elements from c if they are contained in a.
Another possibility is to use std::for_each() to iterate over all elements of a and remove them from b and c:
#include <algorithm>
// ...
std::for_each(begin(a), end(a),
[&b, &c] (Node const& n)
You've correctly used the return value of .erase to obtain the new iterator, but forgot that this iterator gets ++'d immediately at the end of the current iteration of your loop; if the result of .erase was .end, then this is an invalid operation.
(You're actually very fortunate that you get a diagnostic for attempting to increment your now-invalid iterators — the standard guarantees absolutely nothing about this case.)
You need to ++ only when you didn't .erase.
The general pattern looks like this:
for (typename list<T>::iterator it = l.begin(), end = l.end(); it != end; )
// ^^ NB. no "it++" in the loop introduction!
if (foo(*it)) {
// condition satisfied; do the erase, and get the next
// iterator from `.erase` and NOT through incrementing
it = l.erase(it);
else {
// no erasure; do the increment only in this case
You could avoid the problem altogether by using standard algorithms, as Andy suggests.

Looping on C++ iterators starting with second (or nth) item

I am looking for a readable, elegant way to do the following in C++, here shown in Python:
for datum in data[1:]:
# do work.
The iterators on the data in question may not support random access iterators, so I can't just use:
for (mIter = data.begin() + 1; mIter != data.end(); mIter++)
The best I've come up with is the following:
iterable::iterator mIter = data.begin();
for (mIter++; mIter != allMjds.end(); mjdIter++) {
// do work.
It's not too lengthy, but it's hardly expository - at first glance it actually looks like a mistake!
Another solution is to have an "nth element" helper function, I guess. Is there a more concise way?
You can use std::next(iter, n) for a linear-time advance. You can also use the standard std::advance algorithm, though it isn't as simple to use (it takes the iterator by a non-const reference and doesn't return it).
For example,
for (mIter = std::next(data.begin()); mIter != data.end(); ++mIter)
mIter = data.begin();
std::advance(mIter, 1);
for (; mIter != data.end(); ++mIter)
Note that you must make sure that data.size() >= 1, otherwise the code will fail in a catastrophic manner.
#include <iterator>
iterator iter = data.begin();
for (advance(iter, 1); iter != data.end(); ++iter)
// do work
This relies on >= 1 element in data to avoid an exception, though.
You could try:
for (mIter = data.begin() ; ++mIter != data.end() ; )
but you'd need to make sure that if data.begin () == data.end () doing the ++mIter doesn't cause a problem.
Since this is a non-standard for loop, using a while loop might be more appropriate as there are fewer preconceived ideas about how they work, i.e. people looking at your code are more likely to read a while statement than a for statement as there is usually a model of how a for loop should work in their head.
mIter = data.begin ();
while (++mIter != data.end ())
You can use boost::next for this (but you should be sure that the list actually has an element in it before doing so):
#include <algorithm>
#include <iostream>
#include <iterator>
#include <list>
#include <boost/assign.hpp>
#include <boost/next_prior.hpp>
using namespace boost::assign;
int main()
std::list<int> lst = list_of(23)(9)(84)(24)(12)(18);
std::copy(boost::next(lst.begin()), lst.end(), std::ostream_iterator<int>(std::cout, " "));
return 0;
iterable::iterator mIter = data.begin();
std::for_each(++mIter, data.end(), some_func);
where some_func contains the code you want to execute... you could even trivialise it with a simple wrapper function
template <typename _cont, typename _func>
for_1_to_end(_cont const& container, some_func func)
typename _cont::const_iterator it = _cont.begin();
std::for_each(++it, _cont.end(), func);
This is how i would do it
// starting position in the list
int i = 4;
// initialize "it" to point to the first item of data.
std::list<int>::iterator it = data_list.begin();
if (i < data.size()) {
// loop starting from 4 to end of the list.
for (std::advance(it, i); it != token_list.end(); it++) {
//use "it" here
else {
// Error: starting point is greater than size of data
What might be a good solution in a modern c++ way :
std::for_each(cbegin(data)+1,cend(data),[&](const auto& elem)
//do whatever you want with elem here
This will work even if data is empty. It's basically possible to use this in the exact same way as you would do it with a standard for-range loop and has the advantage not to require any additional variable while keeping the code readable.
