Comparing two vectors of maps - c++

I've got two ways of fetching a bunch of data. The data is stored in a sorted vector<map<string, int> >.
I want to identify whether there are inconsistencies between the two vectors.
What I'm currently doing (pseudo-code):
for i in 0... min(length(vector1), length(vector2)):
for (k, v) in vector1[i]:
if v != vector2[i][k]:
// report that k is bad for index i,
// with vector1 having v, vector2 having vector2[i][k]
for i in 0... min(length(vector1), length(vector2)):
for (k, v) in vector2[i]:
if v != vector1[i][k]:
// report that k is bad for index i,
// with vector2 having v, vector1 having vector1[i][k]
This works in general, but breaks horribly if vector1 has a, b, c, d and vector2 has a, b, b1, c, d (it reports brokenness for b1, c, and d). I'm after an algorithm that tells me that there's an extra entry in vector2 compared to vector1.
I think I want to do something where when I encountered mismatches entries, I look at the next entries in the second vector, and if a match is found before the end of the second vector, store the index i of the entry found in the second vector, and move to matching the next entry in the first vector, beginning with vector2[i+1].
Is there a neater way of doing this? Some standard algorithm that I've not come across?
I'm working in C++, so C++ solutions are welcome, but solutions in any language or pseudo-code would also be great.
Example
Given the arbitrary map objects: a, b, c, d, e, f and g;
With vector1: a, b, d, e, f
and vector2: a, c, e, f
I want an algorithm that tells me either:
Extra b at index 1 of vector1, and vector2's c != vector1's d.
or (I'd view this as an effectively equivalent outcome)
vector1's b != vector2's c and extra d at index 2 of vector1
Edit
I ended up using std::set_difference, and then doing some matching on the diffs from both sets to work out which entries were similar but different, and which had entries completely absent from the other vector.

Something like the std::mismatch algorithm
You could also use std::set_difference

It sounds like you're looking for the diff algorithm. The idea is to identify the longest common subsequence of the two vectors (using map equality), then recurse down the non-common portions. Eventually you'll have an alternating list of vector sub-sequences that are identical, and sub-sequences that have no common elements. You can then easily produce whatever output you like from this.
Apply it to the two vectors, and there you go.
Note that since map comparison is expensive, if you can hash the maps (use a strong hash - collisions will result in incorrect output) and use the hashes for comparisons you'll save a lot of time.
Once you're down to the mismatched subsequences at the end, you'll have something like:
Input vectors: a b c d e f, a b c' d e f
Output:
COMMON a b
LEFT c
RIGHT c'
COMMON d e f
You can then individually compare the maps c and c' to figure out how they differ.
If you have a mutation and insertion next to each other, it gets more complex:
Input vectors: a b V W d e f, a b X Y d e f
Output:
COMMON a b
LEFT V W
RIGHT X Y
COMMON d e f
Determining whether to match V and W against X or Y (or not at all) is something you'll have to come up with a heuristic for.
Of course, if you don't care about how the content of the maps differ, then you can stop here, and you have the output you want.

What exactly are you trying to achieve? Could you please define precisely what output you expect in terms of the input? Your pseudo code compares maps at the vector index. If that is not the correct semantics, then what is?

Can you associate with each map some kind of checksum (or Blumen filter) - that at single check you could be able to decide if comparison has a sense.

In your example, note that is not possible to differentiate between
Extra b at index 1 of vector1, and
vector2's c != vector1's d.
and
Extra b at index 1 of vector 1, extra
d at index 2 of v1, and extra c at 1
in v2
because it is not clear that "c" shoud be compared to "d", it could be compared to "b" either. I assume the vectors are not sorted, because std::map doesn't provide a relational operator. Rather are the maps, which is as far as I see completly irrelevant ;-)
So your example is slightly misreading. It could even be
Compare
b f e a d
with
a c f e
You can check each element of the first vector against each element of the second vector.
This has quadratic runtime.
for i in 0... length(vector1):
foundmatch = false;
for j in 0... length(vector2):
mismatch = false;
for (k, v) in vector1[i]:
if v != vector2[j][k]:
mismatch = true;
break; // no need to compare against the remaining keys.
if (!mismatch) // found matching element j in vector2 for element i in vector1
foundmatch = true;
break; // no need to compare against the remaining elements in vector2
if (foundmatch)
continue;
else
// report that vector1[i] has no matching element in vector2[]
// "extra b at i"
If you want the find the missing elements, just swap vector1 and vector2.
If you want to check in a element in vector2 mismatches to a element in vector1 in only a single key, you have to add additional code around "no need to compare against the remainig keys".

Related

C++20 comparing two lazily sorted ranges

The question
I have two ranges, call them v,w that are sorted in a given fashion and can be compared (call the order relation T). I want to compare them lexicographically but after sorting them in a different way (call this other order relation S). For this I do not really need the ranges to be completely sorted: I only need to lazily evaluate the elements on the sorted vectors until I find a difference. For example if the maximum of v in this new order is larger than the maximum of w, then I need to only look once in the ordered vectors. In the worst case that v == w I'd look up in all elements.
I understand that C++20 std::ranges::views allows me to get a read only view of v and w that is lazily evaluated. Is it possible to get a custom sorted view that is still lazily evaluated? if I were able to define some pseudocode like
auto v_view_sorted_S = v | std::views::lazily_sort();
auto w_view_sorted_S = w | std::views::lazily_sort();
Then I could simply call std::ranges::lexicographical_compare(v_view_sorted_S, w_view_sorted_S).
How does one implement this?
Would simply calling std::ranges::sort(std::views::all(v)) work? in the sense that will it accept a view instead of an actual range and more importantly evaluate the view lazily? I get from the comments to the reply in this question that with some conditions std::ranges::sort can be applied to views, and even transformed ones. But I suspect that it sorts them at the call time, is that the case?
The case I want it used:
I am interested in any example but the very particular use case that I have is the following. It is irrelevant for the question, but helps putting this in context
The structures v and w are of the form
std::array<std::vector<unsigned int>,N> v;
Where N is a compile-time constant. Moreover, for each 0 <= i < N, v[i] is guaranteed to be non-increasing. The lexicographical order thus obtained for any two ordered arrays is what I called T above.
What I am interested is in comparing them by the following rule: given an entry a = v[i][j] and b = v[k][l] with 0 <= i,k < N and j,l >= 0. Then declare a > b if that relation holds as unsigned integers or if a == b as unsigned integers and i < k.
After ordering all entries of v and w with respect to this order, then I want to compare them lexicographically.
Example, if v = {{2,1,1}, {}, {3,1}}, w = {{2,1,0}, {2}, {3,0}} and z = {{2,1,0}, {3}, {2,0}}, then z > w > v.

What's the logic behind the order the elements are passed to a comparison function in std::sort?

I'm practicing lambdas:
int main()
{
std::vector<int> v {1,2,3,4};
int count = 0;
sort(v.begin(), v.end(), [](const int& a, const int& b) -> bool
{
return a > b;
});
}
This is just code from GeeksForGeeks to sort in descending order, nothing special. I added some print statements (but took them out for this post) to see what was going on inside the lambda. They print the entire vector, and the a and b values:
1 2 3 4
a=2 b=1
2 1 3 4
a=3 b=2
3 2 1 4
a=4 b=3
4 3 2 1 <- final
So my more detailed question is:
What's the logic behind the order the vector elements are being passed into the a and b parameters?
Is b permanently at index 0 while a is iterating? And if so, isn't it a bit odd that the second param passed to the lambda stays at the first element? Is it compiler-specific? Thanks!
By passing a predicate to std::sort(), you are specifying your sorting criterion. The predicate must return true if the first parameter (i.e., a) precedes the second one (i.e., b), for the sorting criterion you are specifying.
Therefore, for your predicate:
return a > b;
If a is greater than b, then a will precede b.
So my more detailed question is: What's the logic behind the order the vector elements are being passed into the a and b parameters?
a and b are just pairs of elements of the elements you are passing to std::sort(). The "logic" will depend on the underlying algorithm that std::sort() implements. The pairs may also differ for calls with identical input due to randomization.
Is 'b' permanently at index 0 while 'a' is iterating? And if so, isn't it a bit odd that the second param passed to the lambda stays at the first element?
No, because the first element is the higher.
Seems that, with this algorithm, all elements are checked (and maybe switched) with the higher one (at first round) and the higher one is placed in first position; so b ever points to the higher one.
For Visual Studio, std::sort uses insertion sort if the sub-array size is <= 32 elements. For a larger sub-array, it uses intro sort, which is quick sort unless the "recursion" depth gets too deep, in which case it switches to heap sort. The output you program produces appears to correspond to some variation of insertion sort. Since the compare function is "less than", and since insertion sort is looking for out of order due to left values "greater than" right values, the input parameters are swapped.
You just compare two elements, with a given ordering. This means that if the order is a and then b, then the lambda must return true.
The fact that a or b are the first or the last element of the array, or fixed, depends on the sorting algorithm and of course of your data!

Having an iterator as a sliding window of 3 elements that can overshoot bounds (possibly using Boost)

Having read this SO post and exploring Boost.Iterator, I want to see if I can make a sliding window of size 3 iterate through a single vector where the final iteration has an 'empty third element'.
Assuming that the vector size is >= 2, an example:
{a, b, c, d, e, f, g}
We will always start on index 1 because this algorithm I'm implementing requires a 'previous' element to be present and does not need to operate on the first element (so we would iterate from i = 1 while i < size()):
V
[a, b, c]
{a, b, c, d, e, f, g}
when I move to the next iteration, it would look like:
V
[b, c, d]
{a, b, c, d, e, f, g}
and upon reaching the last element in the iteration, it would have this:
V
[f, g, EMPTY]
{a, b, c, d, e, f, g}
What I want is to be able to grab the "prev" and check if "hasNext" and grab the next element if available. My goal is very clean modern C++ code, and it not doing bookkeeping of tracking pointers/references for three different elements makes the code a lot cleaner:
for (const auto& it : zippedIterator(dataVector)) {
someFunc(it.first, triplet.second);
if (someCondition(it.second) && hasThirdElement) {
anotherFunc(it.second, it.third)
}
}
I was trying to see if this is possible with boost's zip iterator, but I don't know if it allows me to overshoot the end and have some empty value.
I've thought of doing some hacky stuff like having a dummy final element, but then I have to document it and I'm trying to write clean code with zero hacky tricks.
I was also going to roll my own iterator but apparently std::iterator is deprecated.
I also don't want to create a copy of the underlying vector since this will be used in a tight loop that needs to be fast and copying everything would be very expensive for the underlying objects. It doesn't need to be extremely optimized, but copying the iterator values into a new array is out of the question.
If this were a matter of simply having a sized window into a range, then what you really want is to have a range that you can advance. In your case, that range is 3 elements long, but there's no reason that a general mechanism couldn't allow for a variable-sized range. It would just be a pair of iterators, such that you can ++ or -- both of them at the same time.
The problem you run into is that you want to manufacture an element if the subrange is off the end of the range. That complicates things; that would require proxy-iterators and so forth.
If you want a solution for your specific case (a 3-element sized range, where the last element can be manufactured if it's off the end of the main range), then you first need to decide if you want to have an actual type for this. That is, is it worth implementing a whole type, rather than a couple of one-off utility functions?
My way to handle this would be to redefine the problem. What you seem to have is a current element, just like any other iteration. But you want to be able to access the previous element. And you want to be able to peek ahead to the next element; if there is none, then you want to manufacture some default. So... perform iteration, but write a couple of utility functions that let you access what you need from the current element.
for(auto curr = ++dataVector.begin();
curr != dataVector.end();
++curr)
{
someFunc(prevElement(curr), *curr);
auto nextIt = curr + 1;
if(nextIt != dataVector.end() && someCondition(*curr))
anotherFunc(*curr, *nextIt)
}
prevElement is a simple function that accesses the element before the given iterator.
template<typename It>
//requires BidirectionalIterator<It>
decltype(auto) prevElement(It curr) {return *(--curr);}
If you want to have a function to check the next element and manufacture a value for it, that can be done too. This one has to return a prvalue of the element, since we may have to manufacture it:
template<typename It>
//requires ForwardIterator<It>
auto checkNextElement(It curr, It endIt)
{
++curr;
if(curr == endIt)
return std::iterator_traits<It>::value_type{};
return *curr;
}
Yes, this isn't all clever, with special range types and the like. But the stuff you're doing is hardly common, particularly the having to manufacture the next element as you do. By keeping things simple and obvious, you make it easy for someone to read your code without having to understand some specialized sub-range type.

STL pair comparison - first elements

Can someone explain meaning of this paragraph
The great advantage of pairs is that they have built-in operations to compare themselves. Pairs are compared first-to-second element. If the first elements are not equal, the result will be based on the comparison of the first elements only; the second elements will be compared only if the first ones are equal. The array (or vector) of pairs can easily be sorted by STL internal functions.
and hence this
For example, if you want to sort the array of integer points so that they form a polygon, it’s a good idea to put them to the vector< pair<double, pair<int,int> >, where each element of vector is { polar angle, { x, y } }. One call to the STL sorting function will give you the desired order of points.
I have been struggling for an hour to understand this.
Source
Consider looking at operator< for pair<A,B>, which is a class that looks something like:
struct pairAB {
A a;
B b;
};
You could translate that paragraph directly into code:
bool operator<(const pairAB& lhs, const pairAB& rhs) {
if (lhs.a != rhs.a) { // If the first elements are not equal
return lhs.a < rhs.a; // the result will be based on
} // the comparison of the first elements only
return lhs.b < rhs.b; // the second elements will be compared
// only if the first ones are equal.
}
Or, thinking more abstractly, this is how lexicographic sort works. Think of how you would order two words. You'd compare their first letters - if they're different, you can stop and see which one is less. If they're the same, then you go onto the second letter.
The first paragraph says that pairs have an ordering as follows: if you have (x, y) and (z, w), and you compare them, then it will first check if x is smaller (or larger) than z: if yes, than the first pair is smaller (or larger) than the second. If x = z, however, then it will compare y and w. This makes it very convenient to do stuff like sorting a vector of pairs if the first elements of the pairs are more important to the order than the second elements.
The second paragraph gives an interesting application. Suppose you stand at some point on a plane, and there's a polygon enclosing you. Then each point will have an angle and a distance. But given the points, how do you know in what order should they be to form a polygon (without crisscrossing themselves)? If you store the points in this format (angle, distance), then you'll get the circling direction for free. That's actually rather neat.
The STL pair is a container to hold two objects together. Consider this for example,
pair a, b;
The first element can be accessed via a.first and the second via a.second.
The first paragraph is telling us that STL provides built-in operations to compare two pairs. For example, you need to compare 'a' and 'b', then the comparison is first done using a.first and b.first. If both the values are same, then the comparison is done using a.second and b.second. Since this is a built-in functionality, you can easily use it with the internal functions of STL like sort, b_search, etc.
The second paragraph is an example of how this might be used. Consider a situation where you would want to sort the points in a polygon. You would first want to sort them based on their polar angle, then the x co-ordinate and then the y co-ordinate. Thus we make use of the pair {angle, {x,y}}. So any comparison would be first done on the angle, then advanced to the x value and then the y value.
It will be easier to understand if to compare a simple example of pairs of last names and first names.
For example if you have pairs
{ Tomson, Ann }
{ Smith, Tony }
{ Smith, John }
and want to sort them in the ascending order you have to compare the pairs with each other.
If you compare the first two pairs
{ Tomson, Ann }
{ Smith, Tony }
then the last name of the first pair is greater than the last name of the second pair. So there is no need to compare also the first names. It is already clear that pair
{ Smith, Tony }
has to precede pair
{ Tomson, Ann }
On the other hand if you compare pairs
{ Smith, Tony }
{ Smith, John }
then the last names of the pairs are equal. So you need to compare the first names of the pairs. As John is less than Tony then it is clear that pair
{ Smith, John }
will precede pair
{ Smith, Tony }
though the last names (the first elements of the pairs) are equal.
As for this pair { polar angle, { x, y } } then if polar ahgles of two different pairs are equal then there will be compared { x, y } that in turn a pair. So if fird elements ( x ) are equal than there will be compared y(s).
It's actually when a you have vector/arrays of pairs you don't have to care about sorting when you use sort() function,You just use sort(v.begin(),v.end())-> it will be automatically sorted on the basis of first element and when first elements are equal they will compared using second element. See code and output in the link,it will be all clear. https://ideone.com/Ad2yVG .see code in link

Total merging search optimization

Assume there is a vector VA of size N, and each element is another vector of type T. There is an operation on type T and returning a new value of type T, i.e., bool merge(T a, T b, T &ret);. If a and c can be merged, then store the result in ret and return true; otherwise, return false. The merge operation is reflective and transitive.
A solution is found if either:
∃ x0, x1, ..., xN-1. merge(VA[0][x0], VA[1][x1], merge(VA[2][x2], ..., merge(VA[N-2][xN-2],VA[N-1][xN-1], ret)...));
any elements from N-1 (not N) sub-vectors can be merged (pick any N-1 with exactly one exception).
For example:
VA is of size 3. Element a can be merged with Element b with the result c. Element c can be merged with Element d with the result e.
VA[0] = {a}
VA[1] = {b, q}
VA[2] = {d, r}
All solutions in the above example are: {a,b}, {a,d}, {b,d}, {a,b,d}.
The task is to find all solution in the given vector VA.
My C++ code is:
void findAll(unsigned int step, unsigned int size, const T pUnifier, int hole_id) {
if(step == size) printOneResult(pUnifier);
else {
_path[step] = -1;
findAll(step + 1, pUnifier, step);
}
std::vector<T> vec = VA[step];
for(std::vector<T>::const_iterator it = vec.begin(); it < vec.end(); it++) {
T nextUnifier();
if( merge( *it, pUnifier, nextUnifier )) {
_path[lit_id] = it->getID();
findAll(step + 1, nextUnifier, hole_id);
}
}
}
The code contains recursive calls; however, it is not tail recursive. It is running slowly in practice. In reality, the size of VA is possibly hundreds and each sub-vector size is of hundreds, too. I'm wondering whether it can be optimized.
Thank you very much.
If I'm understanding your code correctly, you're performing a (recursive) brute-force search. This is not efficient, since you're given some information about your search space.
I think a good candidate here would be the A* algorithm. You could use the current greatest-chain size as the heuristic, or perhaps even the sum of the squares of the chain sizes.
To improve your code, as you use vectors, you should use the [] operator, with a int counter instead of simple iterators, that are much much slower.
You can improve it even more by minimising the function calls i either of your loops, like previously stacking the values you will use.
Since you didn't explained what really was a T_VEC, i coudln't not wrote the complete iterator-free version, but this should already be a great plus regarding speed.