Searching For Elements With Multiple Matches - c++

I have a vector of Key-Value pairs, where each Key-Value pair is also tagged with an Entry Type code. The possible Entry Type codes are:
enum Type
{
tData = 0,
tSeqBegin = 1, // the beginning of a sequence
tSeqEnd = 2 // the end of a sequence
};
So the Key-Value pair itself looks like this:
struct KeyVal
{
int key_;
string val_;
Type type_;
};
Within the vector are sub-arrays of additional Key-Value pairs. These sub-arrays are called 'sequences'. Sequences can be nested to any level. So sequences can themselves have (optional) sub-sequences of varying lengths. The combination of a Key and Type is unique within a sequence element. That is, within a single sequence element there can only be one 269 data row, but other sequence elements can have their own 269 data rows.
Here is a graphical representation of some sample data, grossly oversimplified (If the 'Type' column is blank, it is of type tData):
Row# Type Key Value
---- ------------- ----- --------
1 35 "W"
2 1181 "IBM"
3 tSeqBegin 268 "3"
4 269 "0"
5 270 "160.3"
6 tSeqEnd 0
7 269 "0"
8 290 "0"
9 tSeqBegin 453 "1" <-- subsequence
10 tSeqEnd 0 <-- end of subsequence
11 tSeqEnd 0
12 269 "0"
13 290 "1"
14 270 "160.4"
15 tSeqEnd 0
16 1759 "ABC"
[EDIT: A note on the above. There is one tSeqBegin that marks the beginning of the whole sequence. The end of each sequence element is marked by a tSeqEnd. But there is no special tSeqEnd that also marks the end of the whole sequence. So for a sequence you will see 1 tSeqBegin and n tSeqEnds, where n is the number of elements within the sequence.
Another note, in the above sequence beginning at row #3 and ending at row #15, there is one subsequence in the 2nd element (rows 7-11). The subsequence is empty, and occupies rows 9 and 10.]
What I'm trying to do is find a sequence element which has multiple Key-Value matches to certain criteria. For example, suppose I want to find the sequence element that has both 269="0" and 290="0". In this case, it should not find element #0 (starting at row 3) because that element doesn't have a 290=... row at all. It should find the element starting at row #7 instead. Ultimately I will extract other fields from this element, but that's beyond the scope of this problem, so I haven't included that data above.
I can't use std::find_if() because find_if() will evaluate each row individually, not the whole sequence element as a unit. So I can't construct a functor that evaluates something like if 269=="0" &&* 290=="0" because no single row will ever evaluate this to true.
I had thought to implement my own find_sequence_element(...) function. But this would involve some fairly complex logic. First I would have to identify the begin() and end() of the entire sequence, noting where each element begin()'s and end()'s. Then I would have to construct some kind of evaluation structure that I could string together like this psudocode:
Condition cond = KeyValueMatch(269, "0") + KeyValueMatch(290, "0");
But this is also complex. I can't just construct a find_sequence_element() that takes exactly 2 parameters, one for the 269 match and another for the 290 match, because I want to use this algorithm for other sequences as well, with more or fewer conditions.
Moreover, it seems like I should be able to use the STL <algorithm>'s that already exist. While I know the STL rather well, I can't figure out a way to use find_if() in any straightforward way.
So, finally, here's the question. If you were faced with the above problem, how would you solve it? I know the question is vague. I'm hoping that with some discussion we can narrow the problem domain down until we have an answer.
Some conditions:
I cannot change the single flat vector to a vector of vectors or anything of the like. The reasons for this are complex.
(Placeholder for more conditions :) )
(If consensus is that this should be CW, I will mark it as such)

I would want to process in an online fashion. Have a type which tracks:
where the current sequence started
a count how many requirements have been met so far by the current sequence.
In your example requirements could be represented as a map<int,string>. In general they could be a sequence of binary predicates, or something polymorphic if you need to use different functors for different conditions in the same set, and for efficiency progress could be represented as a sequence of booleans, "has this predicate been met yet?"
When you see a tSeqEnd you clear the set of met requirements and start again. If your count hits the number of requirements, you're done.
The simplest case is that all predicates specify the key value, and hence only match once. It might look something like:
template<typename DataIterator, typename PredIterator>
DataIterator find_matching_sequence(
DataIterator dataFirst,
DataIterator dataLast,
PredIterator predFirst,
PredIterator predLast) {
DataIterator sequence_start = dataFirst;
size_t required = std::distance(predFirst, predLast);
size_t sofar = 0;
while (dataFirst != dataLast) {
if (dataFirst->type == SeqEnd) {
count = 0;
++dataFirst;
sequence_start = dataFirst;
continue;
}
sofar += std::count(predFirst, predLast, Matches(*dataFirst));
if (sofar == required) return sequence_start;
++dataFirst;
}
}
If the same predicate could match multiple rows in a subsequence, then you can use a vector<bool> instead of a count, or possibly a valarray<bool>.
To cope with multiply-nested sub-sequences, you actually need a stack of "how am I doing" records, and you might be able to implement that by the function recursively calling itself, and returning early if it sees enough "end" records to know that it has reached the end of its outermost sequence. But I don't really understand that part of the data format.
So no serious use of STL algorithms, unless you want to std::copy your initial range into an output iterator that performs the online processing ;-)

Hoping I understand your setup correctly, I would proceed as a two-step fashion, nesting search algorithms along the lines of:
template<typename It, typename Pr>
It find_sequence_element ( It begin, It end, Pr predicate );
except that Pr here is a predicate that takes a sequence and returns if that sequence matches, yes or no. An example for a single match could be:
class HasPair
{
int key_; string value_;
public:
Hasmatch ( int key, string value);
template<typename It>
bool operator() ( It begin, It end ) const {
return (std::find_if(begin, end, item_predicate(key_, value_));
}
};
Where item_predicate() is suitable to find the (key_,value_) pair in [begin,end).
If you're interested in finding a sequence with two pairs, write a HasPairs predicate that invokes std::find_if twice, or some more optimized version of a search for two elements.

Related

Removing strings from a string vector, from a substring

I am implementing the unit clause propagation algorithm into c++. I have read in the CNF file to a vector with each clause in individual element of a vector, so for example
1 2 0
1 2 3 0
1 0
3 4 0
So far I am able to isolate individual elements and set them as a string, so in this example i would set the string to be "1".
The next step would be to remove all elements in the vector which contain a 1, so in this example the 1st, 2nd and 3rd elements would be removed. However when i run the vector remove command
clauses.erase(std::remove(clauses.begin(), clauses.end(), "1"), clauses.end());
It will only remove elements which are exactly "1", not the elements which contain a 1 as well as other characters. Is there anyway to remove any element of a vector which contains the string?
(I hope this makes sense, thank you for your time)
Use std::remove_if and search for a 1 in the string (live example):
clauses.erase(
std::remove_if(clauses.begin(), clauses.end(),
[](const std::string &s) {return s.find('1') != std::string::npos;}
),
clauses.end()
);
If you don't have C++11 for the lambda, a normal function, or functor, or Boost lambda, or whatever floats your boat, will work as well.

C++: Time complexity of using STL's sort in order to sort a 2d array of integers on different columns

let's say we have the following 2d array of integers:
1 3 3 1
1 0 2 2
2 0 3 1
1 1 1 0
2 1 1 3
I was trying to create an implementation where the user could give as input the array itself and a string. An example of a string in the above example would be 03 which would mean that the user wants to sort the array based on the first and the fourth column.
So in this case the result of the sorting would be the following:
1 1 1 0
1 3 3 1
1 0 2 2
2 0 3 1
2 1 1 3
I didn't know a lot about the compare functions that are being used inside the STL's sort function, however after searching I created the following simple implementation:
I created a class called Comparator.h
class Comparator{
private:
std::string attr;
public:
Comparator(std::string attr) { this->attr = attr; }
bool operator()(const int* first, const int* second){
std::vector<int> left;
std::vector<int> right;
size_t i;
for(i=0;i<attr.size();i++){
left.push_back(first[attr.at(i) - '0']);
right.push_back(second[attr.at(i) - '0']);
}
for(i=0;i<left.size();i++){
if(left[i] < right[i]) return true;
else if(left[i] > right[i]) return false;
}
return false;
}
};
I need to know the information inside the string so I need to have a class where this string is a private variable. Inside the operator I would have two parameters first and second, each of which will refer to a row. Now having this information I create a left and a right vector where in the left vector I have only the numbers of the first row that are important to the sorting and are specified by the string variable and in the right vector I have only the numbers of the second row that are important to the sorting and are specified by the string variable.
Then I do the needed comparisons and return true or false. The user can use this class by calling this function inside the Sorting.cpp class:
void Sorting::applySort(int **data, std::string attr, int amountOfRows){
std::sort(data, data+amountOfRows, Comparator(attr));
}
Here is an example use:
int main(void){
//create a data[][] variable and fill it with integers
Sorting sort;
sort.applySort(data, "03", number_of_rows);
}
I have two questions:
First question
Can my implementation get better? I use extra variables like the left and right vectors, and then I have some for loops which brings some extra costing to the sorting operation.
Second question
Due to the extra cost, how much worse does the time complexity of the sorting become? I know that STL's sort is O(n*logn) where n is the number of integers that you want to sort. Here n has a different meaning, n is the number of rows and each row can have up to m integers which in turn can be found inside the Comparator class by overriding the operator function and using extra variables(the vectors) and for loops.
Because I'm not sure how exactly is STL's sort implemented I can only make some estimates.
My initial estimate would be O(n*m*log(n)) where m is the number of columns that are important to the sorting however I'm not 100% certain about it.
Thank you in advance
You can certainly improve your comparator. There's no need to copy the columns and then compare them. Instead of the two push_back calls, just compare the values and either return true, return false, or continue the loop according to whether they're less, greater, or equal.
The relevant part of the complexity of sort is O(n * log n) comparisons (in C++11. C++03 doesn't give quite such a good guarantee), where n is the number of elements being sorted. So provided your comparator is O(m), your estimate is OK to sort the n rows. Since attr.size() <= m, you're right.
First question: you don't need left and rigth - you add elements one by one and then iterate over the vectors in the same order. So instead of pushing values to vectors and then iterating over them, simply use the values as you generate them in the first cycle like so:
for(i=0;i<attr.size();i++){
int left = first[attr.at(i) - '0'];
int right = second[attr.at(i) - '0'];
if(left < right) return true;
else if(left > right) return false;
}
Second question: can the time complexity be improved? Not with sorting algorithm that uses direct comparison. On the other had the problem you solve here is somewhat similar to radix sort. And so I believe you should be able to do the sorting in O(n*m) where m is the number of sorting criteria.
1) Firstly to start off you should convert the string into an integer array in the constructor. With validation of values being less than the number of columns.
(You could also have another constructor that takes an integer array as a parameter.
A slight enhancement is to allow negative values to indicate that the order of the sort is reversed for that column. In this case the values would be -N..-1 , 1..N)
2) There is no need for the intermediate left, right arrays.

Algorithm: A Better Way To Calculate Frequencies of a list of words

This question is actually quite simple yet I would like to hear some ideas before jumping into coding. Given a file with a word in each line, calculating most n frequent numbers.
The first and unfortunately only thing that pops up in my mind use to use a std::map. I know fellow C++'ers will say that unordered_map would be so much reasonable.
I would like to know if anything could be added to the algorithm side or this is just basically 'whoever picks the best data structure wins' type of question. I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n) running time however I assume it will be to complex to implement
Any ideas?
The best data structure to use for this task is a Trie:
http://en.wikipedia.org/wiki/Trie
It will outperform a hash table for counting strings.
There are many different approaches to this question. It would finally depend on the scenario and others factors such as the size of the file (If the file has a billion lines) then a HashMapwould not be an efficient way to do it. Here are some things which you can do depending on your problem:
If you know that the number of unique words are very limited, you can use a TreeMap or in your case std::map.
If the number of words are very large then you can build a trie and keep count of various words in another data structure. This could be a heap (min/max depends on what you want to do) of size n. So you don't need to store all the words, just the necessary ones.
I would not start with std::map (or unordered_map) if I had much choice (though I don't know what other constraints may apply).
You have two data items here, and you use one as the key part of the time, but the other as the key another part of the time. For that, you probably want something like a Boost Bimap or possibly Boost MultiIndex.
Here's the general idea using Bimap:
#include <boost/bimap.hpp>
#include <boost/bimap/list_of.hpp>
#include <iostream>
#define elements(array) ((sizeof(array)/sizeof(array[0])))
class uint_proxy {
unsigned value;
public:
uint_proxy() : value(0) {}
uint_proxy& operator++() { ++value; return *this; }
unsigned operator++(int) { return value++; }
operator unsigned() const { return value; }
};
int main() {
int b[]={2,4,3,5,2,6,6,3,6,4};
boost::bimap<int, boost::bimaps::list_of<uint_proxy> > a;
// walk through array, counting how often each number occurs:
for (int i=0; i<elements(b); i++)
++a.left[b[i]];
// print out the most frequent:
std::cout << a.right.rbegin()->second;
}
For the moment, I've only printed out the most frequent number, but iterating N times to print out the N most frequent is pretty trivial.
If you are just interested in the top N most frequent words, and you don't need it to be exact, then there is a very clever structure you can use. I heard of this by way of Udi Manber, it works as follows:
You create an array of N elements, each element tracks a value and a count, you also keep a counter that indexes into this array. Additionally, you have a map from value to index into that array.
Every time you update your structure with a value (like a word from a stream of text) you first check your map to see if that value is already in your array, if it is you increment the count for that value. If it is not then you decrement the count of whatever element your counter is pointing at and then increment the counter.
This sounds simple, and nothing about the algorithm makes it seem like it will yield anything useful, but for typical real data it tends to do very well. Normally if you wish to track the top N things you might want to make this structure with the capacity of 10*N, since there will be a lot of empty values in it. Using the King James Bible as input, here is what this structure lists as the most frequent words (in no particular order):
0 : in
1 : And
2 : shall
3 : of
4 : that
5 : to
6 : he
7 : and
8 : the
9 : I
And here are the top ten most frequent words (in order):
0 : the , 62600
1 : and , 37820
2 : of , 34513
3 : to , 13497
4 : And , 12703
5 : in , 12216
6 : that , 11699
7 : he , 9447
8 : shall , 9335
9 : unto , 8912
You see that it got 9 of the top 10 words correct, and it did so using space for only 50 elements. Depending on your use case the savings on space here may be very useful. It is also very fast.
Here is the implementation of topN that I used, written in Go:
type Event string
type TopN struct {
events []Event
counts []int
current int
mapped map[Event]int
}
func makeTopN(N int) *TopN {
return &TopN{
counts: make([]int, N),
events: make([]Event, N),
current: 0,
mapped: make(map[Event]int, N),
}
}
func (t *TopN) RegisterEvent(e Event) {
if index, ok := t.mapped[e]; ok {
t.counts[index]++
} else {
if t.counts[t.current] == 0 {
t.counts[t.current] = 1
t.events[t.current] = e
t.mapped[e] = t.current
} else {
t.counts[t.current]--
if t.counts[t.current] == 0 {
delete(t.mapped, t.events[t.current])
}
}
}
t.current = (t.current + 1) % len(t.counts)
}
Given a file with a word in each line, calculating most n frequent numbers.
...
I've searched it over the internet and read that hash table and a priority queue might provide an algorithm with O(n)
If you meant the *n*s arethe same then no, this is not possible. However, if you just meant time linear in terms of the size of the input file, then a trivial implementation with a hash table will do what you want.
There might be probabilistic approximate algorithms with sublinear memory.

Word Frequency Statistics

In an pre-interview, I am faced with a question like this:
Given a string consists of words separated by a single white space, print out the words in descending order sorted by the number of times they appear in the string.
For example an input string of “a b b” would generate the following output:
b : 2
a : 1
Firstly, I'd say it is not so clear that whether the input string is made up of single-letter words or multiple-letter words. If the former is the case, it could be simple.
Here is my thought:
int c[26] = {0};
char *pIn = strIn;
while (*pIn != 0 && *pIn != ' ')
{
++c[*pIn];
++pIn;
}
/* how to sort the array c[26] and remember the original index? */
I can get the statistics of the frequecy of every single-letter word in the input string, and I can get it sorted (using QuickSort or whatever). But after the count array is sorted, how to get the single-letter word associated with the count so that I can print them out in pair later?
If the input string is made of of multiple-letter word, I plan to use a map<const char *, int> to track the frequency. But again, how to sort the map's key-value pair?
The question is in C or C++, and any suggestion is welcome.
Thanks!
I would use a std::map<std::string, int> to store the words and their counts. Then I would use something this to get the words:
while(std::cin >> word) {
// increment map's count for that word
}
finally, you just need to figure out how to print them in order of frequency, I'll leave that as an exercise for you.
You're definitely wrong in assuming that you need only 26 options, 'cause your employer will want to allow multiple-character words as well (and maybe even numbers?).
This means you're going to need an array with a variable length. I strongly recommend using a vector or, even better, a map.
To find the character sequences in the string, find your current position (start at 0) and the position of the next space. Then that's the word. Set the current position to the space and do it again. Keep repeating this until you're at the end.
By using the map you'll already have the word/count available.
If the job you're applying for requires university skills, I strongly recommend optimizing the map by adding some kind of hashing function. However, judging by the difficulty of the question I assume that that is not the case.
Taking the C-language case:
I like brute-force, straightforward algos so I would do it in this way:
Tokenize the input string to give an unsorted array of words. I'll have to actually, physically move each word (because each is of variable length); and I think I'll need an array of char*, which I'll use as the arg to qsort( ).
qsort( ) (descending) that array of words. (In the COMPAR function of qsort(), pretend that bigger words are smaller words so that the array acquires descending sort order.)
3.a. Go through the now-sorted array, looking for subarrays of identical words. The end of a subarray, and the beginning of the next, is signalled by the first non-identical word I see.
3.b. When I get to the end of a subarray (or to the end of the sorted array), I know (1) the word and (2) the number of identical words in the subarray.
EDIT new step 4: Save, in another array (call it array2), a char* to a word in the subarry and the count of identical words in the subarray.
When no more words in sorted array, I'm done. it's time to print.
qsort( ) array2 by word frequency.
go through array2, printing each word and its frequency.
I'M DONE! Let's go to lunch.
All the answers prior to mine did not give really an answer.
Let us think on a potential solution.
There is a more or less standard approach for counting something in a container.
We can use an associative container like a std::map or a std::unordered_map. And here we associate a "key", in this case the word, to a count, with a value, in this case the count of the specific word.
And luckily the maps have a very nice index operator[]. This will look for the given key and, if found, return a reference to the value. If not found, then it will create a new entry with the key and return a reference to the new entry. So, in both cases, we will get a reference to the value used for counting. And then we can simply write:
std::unordered_map<char,int> counter{};
counter[word]++;
And that looks really intuitive.
After this operation, you have already the frequency table. Either sorted by the key (the word), by using a std::map or unsorted, but faster accessible with a std::unordered_map.
Now you want to sort according to the frequency/count. Unfortunately this is not possible with maps.
Therefore we need to use a second container, like a ```std::vector`````which we then can sort unsing std::sort for any given predicate, or, we can copy the values into a container, like a std::multiset that implicitely orders its elements.
For getting out the words of a std::string we simply use a std::istringstream and the standard extraction operator >>. No big deal at all.
And because writing all this long names for the std containers, we create alias names, with the using keyword.
After all this, we now write ultra compact code and fulfill the task with just a few lines of code:
#include <iostream>
#include <string>
#include <sstream>
#include <utility>
#include <set>
#include <unordered_map>
#include <type_traits>
#include <iomanip>
// ------------------------------------------------------------
// Create aliases. Save typing work and make code more readable
using Pair = std::pair<std::string, unsigned int>;
// Standard approach for counter
using Counter = std::unordered_map<Pair::first_type, Pair::second_type>;
// Sorted values will be stored in a multiset
struct Comp { bool operator ()(const Pair& p1, const Pair& p2) const { return (p1.second == p2.second) ? p1.first<p2.first : p1.second>p2.second; } };
using Rank = std::multiset<Pair, Comp>;
// ------------------------------------------------------------
std::istringstream text{ " 4444 55555 1 22 4444 333 55555 333 333 4444 4444 55555 55555 55555 22 "};
int main() {
Counter counter;
// Count
for (std::string word{}; text >> word; counter[word]++);
// Sort
Rank rank(counter.begin(), counter.end());
// Output
for (const auto& [word, count] : rank) std::cout << std::setw(15) << word << " : " << count << '\n';
}

How to get the next prefix in C++?

Given a sequence (for example a string "Xa"), I want to get the next prefix in order lexicographic (i.e "Xb"). The next of "aZ" should be "b"
A motivating use case where this function is useful is described here.
As I don't want to reinvent the wheel, I'm wondering if there is any function in C++ STL or boost that can help to define this generic function easily?
If not, do you think that this function can be useful?
Notes
Even if the examples are strings, the function should work for any Sequence.
The lexicographic order should be a template parameter of the function.
From the answers I conclude that there is nothing on C++/Boost that can help to define this generic function easily and also that this function is too specific to be proposed for free. I will implement a generic next_prefix and after that I will request if you find it useful.
I have accepted the single answer that gives some hints on how to do that even if the proposed implementation is not generic.
I'm not sure I understand the semantics by which you wish the string to transform, but maybe something like the following can be a starting point for you. The code will increment the sequence, as if it was a sequence of digits representing a number.
template<typename Bi, typename I>
bool increment(Bi first, Bi last, I minval, I maxval)
{
if( last == first ) return false;
while( --last != first && *last == maxval ) *last = minval;
if( last == first && *last == maxval ) {
*last = minval;
return false;
}
++*last;
return true;
}
Maybe you wish to add an overload with a function object, or an overload or specialization for primitives. A couple of examples:
string s1("aaz");
increment(s1.begin(), s1.end(), 'a', 'z');
cout << s1 << endl; // aba
string s2("95");
do {
cout << s2 << ' '; // 95 96 97 98 99
} while( increment(s2.begin(), s2.end(), '0', '9') );
cout << endl;
That seem so specific that I can't see how it would get in STL or boost.
When you say the order is a template parameter, what are you envisaging will be passed? A comparator that takes two characters and returns bool?
If so, then that's a bit of a nightmare, because the only way to find "the least char greater than my current char" is to sort all the chars, find your current char in the result, and step forward one (or actually, if some chars might compare equal, use upper_bound with your current char to find the first greater char).
In practice, for any sane string collation you can define a "get the next char, or warn me if I gave you the last char" function more efficiently, and build your "get the next prefix" function on top of that. Hopefully, permitting an arbitrary order is more flexibility than you need.
Orderings are typically specified as a comparator, not as a sequence generator.
Lexicographical orderings in particular tend be only partial, for example, in case or diacritic insensitivity. Therefore your final product will be nondeterministic, or at best arbitrary. ("Always choose lowest numerical encoding"?)
In any case, if you accept a comparator as input, the only way to translate that to an increment operation would be to compare the current value against every other in the character space. Which could work, 127 values being so few (a comparator-sorted table would make short work of the problem), or could be impossibly slow, if you use any other kind of character.
The best way is likely to define the character ordering somehow, then define the rules from going from one character to two characters to three characters.
Use whatever sort function you wish to use over the complete list of characters that you want to include, then just use that as the ordering. Find the index of the current character, and you can easily find the previous and next characters. Only advance the right-most character, unless it's going to roll over, then advance the next character to the left.
In other words, reinventing the wheel is like 10 lines of Python. Probably less than 500 lines of C++. :)