How to remove elements by substrings from a STL container - c++

I have a vector of objects (objects are term nodes that amongst other fields contai a string field with the term string)
class TermNode {
private:
std::wstring term;
double weight;
...
public:
...
};
After some processing and calculating the scores these objects get finally stored in a vector of TermNode pointers such as
std::vector<TermNode *> termlist;
A resulting list of this vector, containing up to 400 entries, looks like this:
DEBUG: 'knowledge' term weight=13.5921
DEBUG: 'discovery' term weight=12.3437
DEBUG: 'applications' term weight=11.9476
DEBUG: 'process' term weight=11.4553
DEBUG: 'knowledge discovery' term weight=11.4509
DEBUG: 'information' term weight=10.952
DEBUG: 'techniques' term weight=10.4139
DEBUG: 'web' term weight=10.3733
...
What I try to do is to cleanup that final list for substrings also contained in phrases inside the terms list. For example, looking at the above list snippet, there is the phrase 'knowledge discovery' and therefore I would like to remove the single terms 'knowledge' and 'discovery', because they are also in the list and redundant in this context. I want to keep the phrases containing the single terms. I am also thinking about to remove all strings equal or less 3 characters. But that is just a thought for now.
For this cleanup process I would like to code a class using remove_if / find_if (using the new C++ lambdas) and it would be nice to have that code in a compact class.
I am not really sure on how to solve this. The problem is that I first would have to identify what strings to remove, by probably setting a flag as an delete marker. That would mean I would have to pre-process that list. I would have to find the single terms and the phrases that contain one of those single terms. I think that is not an easy task to do and would need some advanced algorithm. Using a suffix tree to identify substrings?
Another loop on the vector and maybe a copy of the same vector could to the clean up. I am looking for something most efficient in a time manner.
I been playing with the idea or direction such as showed in std::list erase incompatible iterator using the remove_if / find_if and the idea used in Erasing multiple objects from a std::vector?.
So the question is basically is there a smart way to do this and avoid multiple loops and how could I identify the single terms for deletion? Maybe I am really missing something, but probably someone is out there and give me a good hint.
Thanks for your thoughts!
Update
I implemented the removal of redundant single terms the way Scrubbins recommended as follows:
/**
* Functor gets the term of each TermNode object, looks if term string
* contains spaces (ie. term is a phrase), splits phrase by spaces and finally
* stores thes term tokens into a set. Only term higher than a score of
* 'skipAtWeight" are taken tinto account.
*/
struct findPhrasesAndSplitIntoTokens {
private:
set<wstring> tokens;
double skipAtWeight;
public:
findPhrasesAndSplitIntoTokens(const double skipAtWeight)
: skipAtWeight(skipAtWeight) {
}
/**
* Implements operator()
*/
void operator()(const TermNode * tn) {
// --- skip all terms lower skipAtWeight
if (tn->getWeight() < skipAtWeight)
return;
// --- get term
wstring term = tn->getTerm();
// --- iterate over term, check for spaces (if this term is a phrase)
for (unsigned int i = 0; i < term.length(); i++) {
if (isspace(term.at(i))) {
if (0) {
wcout << "input term=" << term << endl;
}
// --- simply tokenze term by space and store tokens into
// --- the tokens set
// --- TODO: check if this really is UTF-8 aware, esp. for
// --- strings containing umlauts, etc !!
wistringstream iss(term);
copy(istream_iterator<wstring,
wchar_t, std::char_traits<wchar_t> >(iss),
istream_iterator<wstring,
wchar_t, std::char_traits<wchar_t> >(),
inserter(tokens, tokens.begin()));
if (0) {
wcout << "size of token set=" << tokens.size() << endl;
for_each(tokens.begin(), tokens.end(), printSingleToken());
}
}
}
}
/**
* return set of extracted tokens
*/
set<wstring> getTokens() const {
return tokens;
}
};
/**
* Functor to find terms in tokens set
*/
class removeTermIfInPhraseTokensSet {
private:
set<wstring> tokens;
public:
removeTermIfInPhraseTokensSet(const set<wstring>& termTokens)
: tokens(termTokens) {
}
/**
* Implements operator()
*/
bool operator()(const TermNode * tn) const {
if (tokens.find(tn->getTerm()) != tokens.end()) {
return true;
}
return false;
}
};
...
findPhrasesAndSplitIntoTokens objPhraseTokens(6.5);
objPhraseTokens = std::for_each(
termList.begin(), termList.end(), objPhraseTokens);
set<wstring> tokens = objPhraseTokens.getTokens();
wcout << "size of tokens set=" << tokens.size() << endl;
for_each(tokens.begin(), tokens.end(), printSingleToken());
// --- remove all extracted single tokens from the final terms list
// --- of similar search terms
removeTermIfInPhraseTokensSet removeTermIfFound(tokens);
termList.erase(
remove_if(
termList.begin(), termList.end(), removeTermIfFound),
termList.end()
);
for (vector<TermNode *>::const_iterator tl_iter = termList.begin();
tl_iter != termList.end(); tl_iter++) {
wcout << "DEBUG: '" << (*tl_iter)->getTerm() << "' term weight=" << (*tl_iter)->getNormalizedWeight() << endl;
if ((*tl_iter)->getNormalizedWeight() <= 6.5) break;
}
...
I could'nt use the C++11 lambda syntax, because on my ubuntu servers have currently g++ 4.4.1 installed. Anyways. It does the job for now.
The way to go is to check the quality of the resulting weighted terms with other search result sets and see how I can improve the quality and find a way to boost the more relevant terms in conjunction with the original query term. It might be not an easy task to do, I wish there would be some "simple heuristics".
But that might be another new question when stepped further a little more :-)
So thanks to all for this rich contribution of thoughts!

What you need to do is first, iterate through the list and split up all the multi-word values into single words. If you're allowing Unicode, this means you will need something akin to ICU's BreakIterators, else you can go with a simple punctuation/whitespace split. When each string is split into it's constituent words, then use a hash map to keep a list of all the current words. When you reach a multi-word value, then you can check if it's words have already been found. This should be the simplest way to identify duplicates.

I can suggest you to use the "erase-remove" idiom in this way:
struct YourConditionFunctor {
bool operator()(TermNode* term) {
if (/* you have to remove term */) {
delete term;
return true;
}
return false;
}
};
and then write:
termlist.erase(
remove_if(
termlist.begin(),
termlist.end(),
YourConditionFunctor()
),
termlist.end()
);

Related

C++ - checking a string for all values in an array

I have some parsed text from the Vision API, and I'm filtering it using keywords, like so:
if (finalTextRaw.find("File") != finalTextRaw.npos)
{
LogMsg("Found Menubar");
}
E.g., if the keyword "File" is found anywhere within the string finalTextRaw, then the function is interrupted and a log message is printed.
This method is very reliable. But I've inefficiently just made a bunch of if-else-if statements in this fashion, and as I'm finding more words that need filtering, I'd rather be a little more efficient. Instead, I'm now getting a string from a config file, and then parsing that string into an array:
string filterWords = GetApp()->GetFilter();
std::replace(filterWords.begin(), filterWords.end(), ',', ' '); ///replace ',' with ' '
vector<int> array;
stringstream ss(filterWords);
int temp;
while (ss >> temp)
array.push_back(temp); ///create an array of filtered words
And I'd like to have just one if statement for checking that string against the array, instead of many of them for checking the string against each keyword I'm having to manually specify in the code. Something like this:
if (finalTextRaw.find(array) != finalTextRaw.npos)
{
LogMsg("Found filtered word");
}
Of course, that syntax doesn't work, and it's surely more complicated than that, but hopefully you get the idea: if any words from my array appear anywhere in that string, that string should be ignored and a log message printed instead.
Any ideas how I might fashion such a function? I'm guessing it's going to necessitate some kind of loop.
Borrowing from Thomas's answer, a ranged for loop offers a neat solution:
for (const auto &word : words)
{
if (finalTextRaw.find(word) != std::string::npos)
{
// word is found.
// do stuff here or call a function.
break; // stop the loop.
}
}
As pointed out by Thomas, the most efficient way is to split both texts into a list of words. Then use std::set_intersection to find occurrences in both lists. You can use std::vector as long it is sorted. You end up with O(n*log(n)) (with n = max words), rather than O(n*m).
Split sentences to words:
auto split(std::string_view sentence) {
auto result = std::vector<std::string>{};
auto stream = std::istringstream{sentence.data()};
std::copy(std::istream_iterator<std::string>(stream),
std::istream_iterator<std::string>(), std::back_inserter(result));
return result;
}
Find words existing in both lists. This only works for sorted lists (like sets or manually sorted vectors).
auto intersect(std::vector<std::string> a, std::vector<std::string> b) {
std::sort(a.begin(), a.end());
std::sort(b.begin(), b.end());
auto result = std::vector<std::string>{};
std::set_intersection(std::move_iterator{a.begin()},
std::move_iterator{a.end()},
b.cbegin(), b.cend(),
std::back_inserter(result));
return result;
}
Example of how to use.
int main() {
const auto result = intersect(split("hello my name is mister raw"),
split("this is the final raw text"));
for (const auto& word: result) {
// do something with word
}
}
Note that this makes sense when working with large or undefined number of words. If you know the limits, you might want to use easier solutions (provided by other answers).
You could use a fundamental, brute force, loop:
unsigned int quantity_words = array.size();
for (unsigned int i = 0; i < quantity_words; ++i)
{
std::string word = array[i];
if (finalTextRaw.find(word) != std::string::npos)
{
// word is found.
// do stuff here or call a function.
break; // stop the loop.
}
}
The above loop takes each word in the array and searches the finalTextRaw for the word.
There are better methods using some std algorithms. I'll leave that for other answers.
Edit 1: maps and association
The above code is bothering me because there are too many passes through the finalTextRaw string.
Here's another idea:
Create a std::set using the words in finalTextRaw.
For each word in your array, check for existence in the set.
This reduces the quantity of searches (it's like searching a tree).
You should also investigate creating a set of the words in array and finding the intersection between the two sets.

Is the returned string from a recursive function going out of scope?

Giving as simple of a background context as I possibly can, which I don't think is necessary for what I'm trying to figure out at the moment, I'm trying to implement a graph representation via adjacency list, in my case being an unordered map that has a string key to a struct value that contains Vertex object pointers (the object that is identified by the key), and a vector of its dependencies. The goal is to output a critical path via a sort of DAG resolution algorithm.
So when I need to output a critical path, I'm trying to use a recursive solution I implemented. Basically it looks for a base case (if a job has no dependencies), return a print out of its id, start time and length. Otherwise, find the longest running (in terms of time length) job in its dependency list and call the function on that until you find a job with no dependencies. There can be more than one critical path, and I don't have to print out all of them.
MY QUESTION: I'm debugging this at the moment, and it has no problem printing out a job's properties when its a base case. If it has to recurse through though, the string always comes back as empty (""). Is the recursive call making my string go out of scope by the time it comes back to the caller? Here is the code structure for it. All of the functions below are public members of the same Graph class.
string recurseDeps(unordered_map<string, Dependencies>& umcopy, string key) {
if (umcopy[key].deps.empty()) {
string depPath = " ";
string idarg, starg, larg, deparg;
idarg = key;
starg = " " + to_string(umcopy[key].jobatKey->getStart());
larg = " " + to_string(umcopy[key].jobatKey->getStart() + umcopy[key].jobatKey->getLength());
umcopy.erase(key);
return depPath + idarg + starg + larg;
}
else {
string lengthiestDep = umcopy[key].deps[0];
for (auto i = begin(umcopy[key].deps); i != end(umcopy[key].deps); i++) {
if (umcopy[*i].jobatKey->getLength() >
umcopy[lengthiestDep].jobatKey->getLength()) {
lengthiestDep = *i;
}
}
recurseDeps(umcopy, lengthiestDep);
}
}
string criticalPath(unordered_map<string, Dependencies>& um, vector<Vertex*> aj) {
unordered_map<string, Dependencies> alCopy = um;
string path = aj[0]->getId();
for (auto i = begin(aj); i != end(aj); i++) {
if (um[(*i)->getId()].jobatKey->getLength() >
um[path].jobatKey->getLength()) {
path = (*i)->getId();
}
}
return recurseDeps(alCopy, path);
}
Later on down in the class members, a function called readStream() calls the functions like so:
cout << time << criticalPath(adjList, activeJobs) << endl;
You're not returning the value when you recurse. You're making the recursive call, but discarding the value and just falling off the end of the function. You need to do:
return recurseDeps(umcopy, lengthiestDep);
First of all, to answer your question, since you return by value the string is copied so no need to worry about variables going out of scope.
Secondly, and a much bigger problem, is that not all paths of your recursive function actually returns a value, which will lead to undefined behavior. If your compiler doesn't already warn you about this, you should enable more warnings.

Need suggestion to improve speed for word break (dynamic programming)

The problem is: Given a string s and a dictionary of words dict, determine if s can be segmented into a space-separated sequence of one or more dictionary words.
For example, given
s = "hithere",
dict = ["hi", "there"].
Return true because "hithere" can be segmented as "leet code".
My implementation is as below. This code is ok for normal cases. However, it suffers a lot for input like:
s = "aaaaaaaaaaaaaaaaaaaaaaab", dict = {"aa", "aaaaaa", "aaaaaaaa"}.
I want to memorize the processed substrings, however, I cannot done it right. Any suggestion on how to improve? Thanks a lot!
class Solution {
public:
bool wordBreak(string s, unordered_set<string>& wordDict) {
int len = s.size();
if(len<1) return true;
for(int i(0); i<len; i++) {
string tmp = s.substr(0, i+1);
if((wordDict.find(tmp)!=wordDict.end())
&& (wordBreak(s.substr(i+1), wordDict)) )
return true;
}
return false;
}
};
It's logically a two-step process. Find all dictionary words within the input, consider the found positions (begin/end pairs), and then see if those words cover the whole input.
So you'd get for your example
aa: {0,2}, {1,3}, {2,4}, ... {20,22}
aaaaaa: {0,6}, {1,7}, ... {16,22}
aaaaaaaa: {0,8}, {1,9} ... {14,22}
This is a graph, with nodes 0-23 and a bunch of edges. But node 23 b is entirely unreachable - no incoming edge. This is now a simple graph theory problem
Finding all places where dictionary words occur is pretty easy, if your dictionary is organized as a trie. But even an std::map is usable, thanks to its equal_range method. You have what appears to be an O(N*N) nested loop for begin and end positions, with O(log N) lookup of each word. But you can quickly determine if s.substr(begin,end) is a still a viable prefix, and what dictionary words remain with that prefix.
Also note that you can build the graph lazily. Staring at begin=0 you find edges {0,2}, {0,6} and {0,8}. (And no others). You can now search nodes 2, 6 and 8. You even have a good algorithm - A* - that suggests you try node 8 first (reachable in just 1 edge). Thus, you'll find nodes {8,10}, {8,14} and {8,16} etc. As you see, you'll never need to build the part of the graph that contains {1,3} as it's simply unreachable.
Using graph theory, it's easy to see why your brute-force method breaks down. You arrive at node 8 (aaaaaaaa.aaaaaaaaaaaaaab) repeatedly, and each time search the subgraph from there on.
A further optimization is to run bidirectional A*. This would give you a very fast solution. At the second half of the first step, you look for edges leading to 23, b. As none exist, you immediately know that node {23} is isolated.
In your code, you are not using dynamic programming because you are not remembering the subproblems that you have already solved.
You can enable this remembering, for example, by storing the results based on the starting position of the string s within the original string, or even based on its length (because anyway the strings you are working with are suffixes of the original string, and therefore its length uniquely identifies it). Then, in the beginning of your wordBreak function, just check whether such length has already been processed and, if it has, do not rerun the computations, just return the stored value. Otherwise, run computations and store the result.
Note also that your approach with unordered_set will not allow you to obtain the fastest solution. The fastest solution that I can think of is O(N^2) by storing all the words in a trie (not in a map!) and following this trie as you walk along the given string. This achieves O(1) per loop iteration not counting the recursion call.
Thanks for all the comments. I changed my previous solution to the implementation below. At this point, I didn't explore to optimize on the dictionary, but those insights are very valuable and are very much appreciated.
For the current implementation, do you think it can be further improved? Thanks!
class Solution {
public:
bool wordBreak(string s, unordered_set<string>& wordDict) {
int len = s.size();
if(len<1) return true;
if(wordDict.size()==0) return false;
vector<bool> dq (len+1,false);
dq[0] = true;
for(int i(0); i<len; i++) {// start point
if(dq[i]) {
for(int j(1); j<=len-i; j++) {// length of substring, 1:len
if(!dq[i+j]) {
auto pos = wordDict.find(s.substr(i, j));
dq[i+j] = dq[i+j] || (pos!=wordDict.end());
}
}
}
if(dq[len]) return true;
}
return false;
}
};
Try the following:
class Solution {
public:
bool wordBreak(string s, unordered_set<string>& wordDict)
{
for (auto w : wordDict)
{
auto pos = s.find(w);
if (pos != string::npos)
{
if (wordBreak(s.substr(0, pos), wordDict) &&
wordBreak(s.substr(pos + w.size()), wordDict))
return true;
}
}
return false;
}
};
Essentially one you find a match remove the matching part from the input string and so continue testing on a smaller input.

Alternatives to standard functions of C++ to get speed optimization

Just to clarify that I also think the title is a bit silly. We all know that most built-in functions of the language are really well written and fast (there are ones even written by assembly). Though may be there still are some advices for my situation. I have a small project which demonstrates the work of a search engine. In the indexing phase, I have a filter method to filter out unnecessary things from the keywords. It's here:
bool Indexer::filter(string &keyword)
{
// Remove all characters defined in isGarbage method
keyword.resize(std::remove_if(keyword.begin(), keyword.end(), isGarbage) - keyword.begin());
// Transform all characters to lower case
std::transform(keyword.begin(), keyword.end(), keyword.begin(), ::tolower);
// After filtering, if the keyword is empty or it is contained in stop words list, mark as invalid keyword
if (keyword.size() == 0 || stopwords_.find(keyword) != stopwords_.end())
return false;
return true;
}
At first sign, these functions (alls are member functions of STL container or standard function) are supposed to be fast and not take many time in the indexing phase. But after profiling with Valgrind, the inclusive cost of this filter is ridiculous high: 33.4%. There are three standard functions of this filter take most of the time for that percentage: std::remove_if takes 6.53%, std::set::find takes 15.07% and std::transform takes 7.71%.
So if there are any thing I can do (or change) to reduce the instruction times cost by this filter (like using parallellizing or something like that), please give me your advice. Thanks in advance.
UPDATE: Thanks for all your suggestion. So in brief, I've summarize what I need to do is:
1) Merge tolower and remove_if into one by construct my own loop.
2) Use unordered_set instead of set for faster find method.
Thus I've chosen Mark_B's as the right answer.
First, are you certain that optimization and inlining are enabled when you compile?
Assuming that's the case, I would first try writing my own transformer that combines removing garbage and lower-casing into one step to prevent iterating over the keyword that second time.
There's not a lot you can do about the find without using a different container such as unordered_set as suggested in a comment.
Is it possible for your application that doing the filtering really just is a really CPU-intensive part of the operation?
If you use a boost filter iterator you can merge the remove_if and transform into one, something like (untested):
keyword.erase(std::transform(boost::make_filter_iterator(!boost::bind(isGarbage), keyword.begin(), keyword.end()),
boost::make_filter_iterator(!boost::bind(isGarbage), keyword.end(), keyword.end()),
keyword.begin(),
::tolower), keyword.end());
This is assuming you want the side effect of modifying the string to still be visible externally, otherwise pass by const reference instead and just use count_if and a predicate to do all in one. You can build a hierarchical data structure (basically a tree) for the list of stop words that makes "in-place" matching possible, for example if your stop words are SELECT, SELECTION, SELECTED you might build a tree:
|- (other/empty accept)
\- S-E-L-E-C-T- (empty, fail)
|- (other, accept)
|- I-O-N (fail)
\- E-D (fail)
You can traverse a tree structure like that simultaneously whilst transforming and filtering without any modifications to the string itself. In reality you'd want to compact the multi-character runs into a single node in the tree (probably).
You can build such a data structure fairly trivially with something like:
#include <iostream>
#include <map>
#include <memory>
class keywords {
struct node {
node() : end(false) {}
std::map<char, std::unique_ptr<node>> children;
bool end;
} root;
void add(const std::string::const_iterator& stop, const std::string::const_iterator c, node& n) {
if (!n.children[*c])
n.children[*c] = std::unique_ptr<node>(new node);
if (stop == c+1) {
n.children[*c]->end = true;
return;
}
add(stop, c+1, *n.children[*c]);
}
public:
void add(const std::string& str) {
add(str.end(), str.begin(), root);
}
bool match(const std::string& str) const {
const node *current = &root;
std::string::size_type pos = 0;
while(current && pos < str.size()) {
const std::map<char,std::unique_ptr<node>>::const_iterator it = current->children.find(str[pos++]);
current = it != current->children.end() ? it->second.get() : nullptr;
}
if (!current) {
return false;
}
return current->end;
}
};
int main() {
keywords list;
list.add("SELECT");
list.add("SELECTION");
list.add("SELECTED");
std::cout << list.match("TEST") << std::endl;
std::cout << list.match("SELECT") << std::endl;
std::cout << list.match("SELECTOR") << std::endl;
std::cout << list.match("SELECTED") << std::endl;
std::cout << list.match("SELECTION") << std::endl;
}
This worked as you'd hope and gave:
0
1
0
1
1
Which then just needs to have match() modified to call the transformation and filtering functions appropriately e.g.:
const char c = str[pos++];
if (filter(c)) {
const std::map<char,std::unique_ptr<node>>::const_iterator it = current->children.find(transform(c));
}
You can optimise this a bit (compact long single string runs) and make it more generic, but it shows how doing everything in-place in one pass might be achieved and that's the most likely candidate for speeding up the function you showed.
(Benchmark changes of course)
If a call to isGarbage() does not require synchronization, then parallelization should be the first optimization to consider (given of course that filtering one keyword is a big enough task, otherwise parallelization should be done one level higher). Here's how it could be done - in one pass through the original data, multi-threaded using Threading Building Blocks:
bool isGarbage(char c) {
return c == 'a';
}
struct RemoveGarbageAndLowerCase {
std::string result;
const std::string& keyword;
RemoveGarbageAndLowerCase(const std::string& keyword_) : keyword(keyword_) {}
RemoveGarbageAndLowerCase(RemoveGarbageAndLowerCase& r, tbb::split) : keyword(r.keyword) {}
void operator()(const tbb::blocked_range<size_t> &r) {
for(size_t i = r.begin(); i != r.end(); ++i) {
if(!isGarbage(keyword[i])) {
result.push_back(tolower(keyword[i]));
}
}
}
void join(RemoveGarbageAndLowerCase &rhs) {
result.insert(result.end(), rhs.result.begin(), rhs.result.end());
}
};
void filter_garbage(std::string &keyword) {
RemoveGarbageAndLowerCase res(keyword);
tbb::parallel_reduce(tbb::blocked_range<size_t>(0, keyword.size()), res);
keyword = res.result;
}
int main() {
std::string keyword = "ThIas_iS:saome-aTYpe_Ofa=MoDElaKEYwoRDastrang";
filter_garbage(keyword);
std::cout << keyword << std::endl;
return 0;
}
Of course, the final code could be improved further by avoiding data copying, but the goal of the sample is to demonstrate that it's an easily threadable problem.
You might make this faster by making a single pass through the string, ignoring the garbage characters. Something like this (pseudo-code):
std::string normalizedKeyword;
normalizedKeyword.reserve(keyword.size())
for (auto p = keyword.begin(); p != keyword.end(); ++p)
{
char ch = *p;
if (!isGarbage(ch))
normalizedKeyword.append(tolower(ch));
}
// then search for normalizedKeyword in stopwords
This should eliminate the overhead of std::remove_if, although there is a memory allocation and some new overhead of copying characters to normalizedKeyword.
The problem here isn't the standard functions, it's your use of them. You are making multiple passes over your string when you obviously need to be doing only one.
What you need to do probably can't be done with the algorithms straight up, you'll need help from boost or rolling your own.
You should also carefully consider whether resizing the string is actually necessary. Yeah, you might save some space but it's going to cost you in speed. Removing this alone might account for quite a bit of your operation's expense.
Here's a way to combine the garbage removal and lower-casing into a single step. It won't work for multi-byte encoding such as UTF-8, but neither did your original code. I assume 0 and 1 are both garbage values.
bool Indexer::filter(string &keyword)
{
static char replacements[256] = {1}; // initialize with an invalid char
if (replacements[0] == 1)
{
for (int i = 0; i < 256; ++i)
replacements[i] = isGarbage(i) ? 0 : ::tolower(i);
}
string::iterator tail = keyword.begin();
for (string::iterator it = keyword.begin(); it != keyword.end(); ++it)
{
unsigned int index = (unsigned int) *it & 0xff;
if (replacements[index])
*tail++ = replacements[index];
}
keyword.resize(tail - keyword.begin());
    // After filtering, if the keyword is empty or it is contained in stop words list, mark as invalid keyword
    if (keyword.size() == 0 || stopwords_.find(keyword) != stopwords_.end())
        return false;
    return true;
}
The largest part of your timing is the std::set::find so I'd also try std::unordered_set to see if it improves things.
I would implement it with lower level C functions, something like this maybe (not checking this compiles), doing the replacement in place and not resizing the keyword.
Instead of using a set for garbage characters, I'd add a static table of all 256 characters (yeah, it will work for ascii only), with 0 for all characters that are ok, and 1 for those who should be filtered out. something like:
static const char GARBAGE[256] = { 1, 1, 1, 1, 1, ...., 0, 0, 0, 0, 1, 1, ... };
then for each character in offset pos in const char *str you can just check if (GARBAGE[str[pos]] == 1);
this is more or less what an unordered set does, but will have much less instructions. stopwords should be an unordered set if they're not.
now the filtering function (I'm assuming ascii/utf8 and null terminated strings here):
bool Indexer::filter(char *keyword)
{
char *head = pos;
char *tail = pos;
while (*head != '\0') {
//copy non garbage chars from head to tail, lowercasing them while at it
if (!GARBAGE[*head]) {
*tail = tolower(*head);
++tail; //we only advance tail if no garbag
}
//head always advances
++head;
}
*tail = '\0';
// After filtering, if the keyword is empty or it is contained in stop words list, mark as invalid keyword
if (tail == keyword || stopwords_.find(keyword) != stopwords_.end())
return false;
return true;
}

Invalid heap error when trying to copy elements from a Map to a compatible priority Queue

My program makes a frequency map of characters (which I store in , surprise surprise, a Map), I am trying to copy each element from this map into a Priority Queue so that I can have a sorted copy of these values (I plan to make further use of the Q, that's why am not sorting the map) , but whenever I try to copy these values , the program executes fine for the first two or three iterations and fails on the fourth citing an "Invalid heap" error.
I'm not sure how to proceed from here, so I am posting the code for the classes in question.
#include "srcFile.h"
#include <string>
#include <iostream>
srcFile::srcFile(std::string s_flName)
{
// Storing the file name
s_fileName= s_flName;
}
srcFile::srcFile()
{
// Default constructor (never to be used)
}
srcFile::~srcFile(void)
{
}
void srcFile::dispOverallMap ()
{
std::map<char,int>::iterator dispIterator;
dispIterator = map_charFreqDistribution.begin();
charElement *currentChar;
std::cout<<"\n Frequency distribution map \n";
while(dispIterator != map_charFreqDistribution.end())
{
std::cout<< "Character : " << (int)dispIterator->first << " Frequency : "<< dispIterator->second<<'\n';
currentChar = new charElement(dispIterator->first,dispIterator->second);
Q_freqDistribution.push(*currentChar);
dispIterator++;
// delete currentChar;
}
while(!Q_freqDistribution.empty())
{
std::cout<<'\n'<<"Queue Element : " << (int)Q_freqDistribution.top().ch_elementChar << " Frequency : " << Q_freqDistribution.top().i_frequency;
Q_freqDistribution.pop();
}
}
map_charFreqDistribution has already been populated, if I remove the line
Q_freqDistribution.push(*currentChar);
Then I can verify that the Map is indeed there.
Also , both the Q and the use charElement as the template type , its nothing except the character and its frequency, along with 2 pointers to facilitate tree generation (unused upto this point)
Adding the definition of charElement on request
#pragma once
class charElement
{
public:
// Holds the character for the element in question
char ch_elementChar;
// Holds the number of times the character appeared in the file
int i_frequency;
// Left pointer for tree
charElement* ptr_left;
// Right pointer for tree
charElement* ptr_right;
charElement(char,int);
charElement(void);
~charElement(void);
void operator=(charElement&);
};
class compareCharElt
{
public:
bool operator()(charElement &operand1,charElement &operand2)
{
// If the frequency of op1 < op2 then return true
if(operand1.i_frequency < operand2.i_frequency) return true;
// If the frequency of op1 > op2 then return false
if(operand1.i_frequency > operand2.i_frequency)return false;
// If the frequency of op1 == op2 then return true (that is priority is indicated to be less even though frequencies are equal)
if(operand1.i_frequency == operand2.i_frequency)return false;
}
};
Definition of Map and Queue
// The map which holds the frequency distribution of all characters in the file
std::map<char,int> map_charFreqDistribution;
void dispOverallMap();
// Create Q which holds character elements
std::priority_queue<charElement,std::vector<charElement>,compareCharElt> Q_freqDistribution;
P.S.This may be a noob question, but Is there an easier way to post blocks of code , putting 4 spaces in front of huge code chunks doesn't seem all that efficient to me! Are pastebin links acceptable here ?
Your vector is reallocating and invalidating your pointers. You need to use a different data structure, or an index into the vector, instead of a raw pointer. When you insert elements into a vector, then pointers to the contents become invalid.
while(dispIterator != map_charFreqDistribution.end())
{
std::cout<< "Character : " << (int)dispIterator->first << " Frequency : "<< dispIterator->second<<'\n';
currentChar = new charElement(dispIterator->first,dispIterator->second);
Q_freqDistribution.push(*currentChar);
dispIterator++;
delete currentChar;
}
Completely throws people off because it's very traditional for people to have huge problems when using new and delete directly, but there's actually no need for it whatsoever in this code, and everything is actually done by value.
You have two choices. Pick a structure (e.g. std::list) that does not invalidate pointers, or, allocate all charElements on the heap directly and use something like shared_ptr that cleans up for you.
currentChar = new charElement(dispIterator->first,dispIterator->second);
Q_freqDistribution.push(*currentChar);
dispIterator++;
delete currentChar;
In the above code, you create a new charElement object, then push it, and then delete it. When you call delete, that object no longer exists -- not even in the queue. That's probably not what you want.