C++ - Get the "difference" of 2 strings like git - c++

I'm currently working on a project which includes a Win32 console program on my Windows 10 PC and an app for my Windows 10 Mobile Phone. It's about controlling the master and audio session volumes on my PC over the app on my Windows Phone.
The "little" problem I have right now is to get the "difference" between 2 strings.
Let's take these 2 strings for example:
std::string oldVolumes = "MASTER:50:SYSTEM:50:STEAM:100:UPLAY:100";
std::string newVolumes = "MASTER:30:SYSTEM:50:STEAM:100:ROCKETLEAGUE:80:CHROME:100";
Now I want to compare these 2 strings. Lets say I explode each string to a vector with the ":" as delimiter (I have a function named explode to cut the given string by the delimiter and write the string before into a vector).
Good enough. But as you can see, in the old string there's UPLAY with the value 100, but it's missing in the new string. Also, there are 2 new values (RocketLeague and Chrome), which are missing in the old one. But not only the "audio sessions/names" are different, the values are different too.
What I want now is for each session, which is in both strings (like master and system), to compare the values and if the the new value is different to the old one, I want to append this change into another string, like:
std::string volumeChanges = "MASTER:30"; // Cause Master is changed, System not
If there's a session in the old string, but not in the new one, I want to append:
std::string volumeChanges = "MASTER:30:REMOVE:UPLAY";
If there's a session in the new one, which is missing in the old string, I want to append it like that:
std::string volumeChanges = "MASTER:30:REMOVE:UPLAY:ADD:ROCKETLEAGUE:ROCKETLEAGUE:80:ADD:CHROME:CHROME:100";
The volumeChanges string is just to show you, what I need. I'll try to make a better one afterwards.
Do you have any ideas of how to implement such a comparison? I don't need a specific code example or something, just some ideas of how I could do that in theory. It's like GIT at least. If you make changes in a text file, you see in red the deleted text and in green the added one. Something similar to this, just with strings or vectors of strings.

Lets say I explode each string to a vector with the ":" as delimiter (I have a function named explode to cut the given string by the delimiter and write the string before into a vector).
I'm going to advise you further extend that logic to separate them into property objects that discretely maintain a name + value:
struct property {
std::string name;
in32_t value;
bool same_name(property const& o) const {
return name == o.name;
}
bool same_value(property const& o) const {
return value == o.value;
}
bool operator==(property const& o) const {
return same_name(o) && same_value(o);
}
bool operator<(property const& o) const {
if(!same_name(o)) return name < o.name;
else return value < o.value;
}
};
This will dramatically simplify the logic needed to work out which properties were changed/added/removed.
The logic for "tokenizing" this kind of string isn't too difficult:
std::set<property> tokenify(std::string input) {
bool finding_name = true;
property prop;
std::set<property> properties;
while (input.size() > 0) {
auto colon_index = input.find(':');
if (finding_name) {
prop.name = input.substr(0, colon_index);
finding_name = false;
}
else {
prop.value = std::stoi(input.substr(0, colon_index));
finding_name = true;
properties.insert(prop);
}
if(colon_index == std::string::npos)
break;
else
input = input.substr(colon_index + 1);
}
return properties;
}
Then, the function to get the difference:
std::string get_diff_string(std::string const& old_props, std::string const& new_props) {
std::set<property> old_properties = tokenify(old_props);
std::set<property> new_properties = tokenify(new_props);
std::string output;
//We first scan for properties that were either removed or changed
for (property const& old_property : old_properties) {
auto predicate = [&](property const& p) {
return old_property.same_name(p);
};
auto it = std::find_if(new_properties.begin(), new_properties.end(), predicate);
if (it == new_properties.end()) {
//We didn't find the property, so we need to indicate it was removed
output.append("REMOVE:" + old_property.name + ':');
}
else if (!it->same_value(old_property)) {
//Found the property, but the value changed.
output.append(it->name + ':' + std::to_string(it->value) + ':');
}
}
//Finally, we need to see which were added.
for (property const& new_property : new_properties) {
auto predicate = [&](property const& p) {
return new_property.same_name(p);
};
auto it = std::find_if(old_properties.begin(), old_properties.end(), predicate);
if (it == old_properties.end()) {
//We didn't find the property, so we need to indicate it was added
output.append("ADD:" + new_property.name + ':' + new_property.name + ':' + std::to_string(new_property.value) + ':');
}
//The previous loop detects changes, so we don't need to bother here.
}
if (output.size() > 0)
output = output.substr(0, output.size() - 1); //Trim off the last colon
return output;
}
And we can demonstrate that it's working with a simple main function:
int main() {
std::string diff_string = get_diff_string("MASTER:50:SYSTEM:50:STEAM:100:UPLAY:100", "MASTER:30:SYSTEM:50:STEAM:100:ROCKETLEAGUE:80:CHROME:100");
std::cout << "Diff String was \"" << diff_string << '\"' << std::endl;
}
Which yields an output (according to IDEONE.com):
Diff String was "MASTER:30:REMOVE:UPLAY:ADD:CHROME:CHROME:100:ADD:ROCKETLEAGUE:ROCKETLEAGUE:80"
Which, although the contents are in a slightly different order than your example, still contains all the correct information. The contents are in different order because std::set implicitly sorted the attributes by name when tokenizing the properties; if you want to disable that sorting, you'd need to use a different data structure which preserves entry order. I chose it because it eliminates duplicates, which could cause odd behavior otherwise.

In this particular instance, you could do it as follows:
Split the old and new strings by the delimiter, and store the results in a vector.
Loop over the vector with the old data. Look for each word in the vector with new data: e.g. find("MASTER").
If not found add "REMOVE:MASTER" to your results.
If found, compare the numbers and add it to the results if it has been changed.
The added string can be found by looping over the new string and searching for the words in the old string.

I suggest that you enumerate some features (in your case for example: UPLAY present, REMOVE is present, ...)
for every one of those assign a weight if the two strings differs for the given feature.
At the end sum up weights for the features presents in one string and absent in the other and get a number.
This number should represent what you are looking for.
You can adjust weights until you are satisfied with the result.

Maybe my answer will give you some new thoughts. In fact, by tweaking the current code, you can find all the missing words.
std::vector<std::string> splitString(const std::string& str, const char delim)
{
std::vector<std::string> out;
std::stringstream ss(str);
std::string s;
while (std::getline(ss, s, delim)) {
out.push_back(s);
}
return out;
}
std::vector<std::string> missingWords(const std::string& first, const std::string& second)
{
std::vector<std::string> missing;
const auto firstWords = splitString(first, ' ');
const auto secWords = splitString(second, ' ');
size_t i = 0, j = 0;
for(; i < firstWords.size();){
auto findSameWord = std::find(secWords.begin() + j, secWords.end(), firstWords[i]);
if(findSameWord == secWords.end()) {
missing.push_back(firstWords[i]);
j++;
} else {
j = distance(secWords.begin(), findSameWord);
}
i++;
}
return missing;
}

Related

Generate string lexicographically larger than input

Given an input string A, is there a concise way to generate a string B that is lexicographically larger than A, i.e. A < B == true?
My raw solution would be to say:
B = A;
++B.back();
but in general this won't work because:
A might be empty
The last character of A may be close to wraparound, in which case the resulting character will have a smaller value i.e. B < A.
Adding an extra character every time is wasteful and will quickly in unreasonably large strings.
So I was wondering whether there's a standard library function that can help me here, or if there's a strategy that scales nicely when I want to start from an arbitrary string.
You can duplicate A into B then look at the final character. If the final character isn't the final character in your range, then you can simply increment it by one.
Otherwise you can look at last-1, last-2, last-3. If you get to the front of the list of chars, then append to the length.
Here is my dummy solution:
std::string make_greater_string(std::string const &input)
{
std::string ret{std::numeric_limits<
std::string::value_type>::min()};
if (!input.empty())
{
if (std::numeric_limits<std::string::value_type>::max()
== input.back())
{
ret = input + ret;
}
else
{
ret = input;
++ret.back();
}
}
return ret;
}
Ideally I'd hope to avoid the explicit handling of all special cases, and use some facility that can more naturally handle them. Already looking at the answer by #JosephLarson I see that I could increment more that the last character which would improve the range achievable without adding more characters.
And here's the refinement after the suggestions in this post:
std::string make_greater_string(std::string const &input)
{
constexpr char minC = ' ', maxC = '~';
// Working with limits was a pain,
// using ASCII typical limit values instead.
std::string ret{minC};
auto rit = input.rbegin();
while (rit != input.rend())
{
if (maxC == *rit)
{
++rit;
if (rit == input.rend())
{
ret = input + ret;
break;
}
}
else
{
ret = input;
++(*(ret.rbegin() + std::distance(input.rbegin(), rit)));
break;
}
}
return ret;
}
Demo
You can copy the string and append some letters - this will produce a lexicographically larger result.
B = A + "a"

I need to create MultiMap using hash-table but I get time-limit exceeded error (C++)

I'm trying to solve algorithm task: I need to create MultiMap(key,(values)) using hash-table. I can't use Set and Map libraries. I send code to testing system, but I get time-limit exceeded error on test 20. I don't know what exactly this test contains. The code must do following tasks:
put x y - add pair (x,y).If pair exists, do nothing.
delete x y - delete pair(x,y). If pair doesn't exist, do nothing.
deleteall x - delete all pairs with first element x.
get x - print number of pairs with first element x and second elements.
The amount of operations <= 100000
Time limit - 2s
Example:
multimap.in:
put a a
put a b
put a c
get a
delete a b
get a
deleteall a
get a
multimap.out:
3 b c a
2 c a
0
#include <iostream>
#include <fstream>
#include <vector>
using namespace std;
inline long long h1(const string& key) {
long long number = 0;
const int p = 31;
int pow = 1;
for(auto& x : key){
number += (x - 'a' + 1 ) * pow;
pow *= p;
}
return abs(number) % 1000003;
}
inline void Put(vector<vector<pair<string,string>>>& Hash_table,const long long& hash, const string& key, const string& value) {
int checker = 0;
for(int i = 0; i < Hash_table[hash].size();i++) {
if(Hash_table[hash][i].first == key && Hash_table[hash][i].second == value) {
checker = 1;
break;
}
}
if(checker == 0){
pair <string,string> key_value = make_pair(key,value);
Hash_table[hash].push_back(key_value);
}
}
inline void Delete(vector<vector<pair<string,string>>>& Hash_table,const long long& hash, const string& key, const string& value) {
for(int i = 0; i < Hash_table[hash].size();i++) {
if(Hash_table[hash][i].first == key && Hash_table[hash][i].second == value) {
Hash_table[hash].erase(Hash_table[hash].begin() + i);
break;
}
}
}
inline void Delete_All(vector<vector<pair<string,string>>>& Hash_table,const long long& hash,const string& key) {
for(int i = Hash_table[hash].size() - 1;i >= 0;i--){
if(Hash_table[hash][i].first == key){
Hash_table[hash].erase(Hash_table[hash].begin() + i);
}
}
}
inline string Get(const vector<vector<pair<string,string>>>& Hash_table,const long long& hash, const string& key) {
string result="";
int counter = 0;
for(int i = 0; i < Hash_table[hash].size();i++){
if(Hash_table[hash][i].first == key){
counter++;
result += Hash_table[hash][i].second + " ";
}
}
if(counter != 0)
return to_string(counter) + " " + result + "\n";
else
return "0\n";
}
int main() {
vector<vector<pair<string,string>>> Hash_table;
Hash_table.resize(1000003);
ifstream input("multimap.in");
ofstream output("multimap.out");
string command;
string key;
int k = 0;
string value;
while(true) {
input >> command;
if(input.eof())
break;
if(command == "put") {
input >> key;
long long hash = h1(key);
input >> value;
Put(Hash_table,hash,key,value);
}
if(command == "delete") {
input >> key;
input >> value;
long long hash = h1(key);
Delete(Hash_table,hash,key,value);
}
if(command == "get") {
input >> key;
long long hash = h1(key);
output << Get(Hash_table,hash,key);
}
if(command == "deleteall"){
input >> key;
long long hash = h1(key);
Delete_All(Hash_table,hash,key);
}
}
}
How can I do my code work faster?
At very first, a matter of design: Normally, one would pass the key only to the function and calculate the hash within. Your variant allows a user to place elements anywhere within the hash table (using bad hash values), so user could easily break it.
So e. g. put:
using HashTable = std::vector<std::vector<std::pair<std::string, std::string>>>;
void put(HashTable& table, std::string& key, std::string const& value)
{
auto hash = h1(key);
// ...
}
If at all, the hash function could be parametrised, but then you'd write a separate class for (wrapping the vector of vectors) and provide the hash function in constructor so that a user cannot exchange it arbitrarily (and again break the hash table). A class would come with additional benefits, most important: better encapsulation (hiding the vector away, so user could not change it with vector's own interface):
class HashTable
{
public:
// IF you want to provide hash function:
template <typename Hash>
HashTable(Hash hash) : hash(hash) { }
void put(std::string const& key, std::string const& value);
void remove(std::string const& key, std::string const& value); //(delete is keyword!)
// ...
private:
std::vector<std::vector<std::pair<std::string, std::string>>> data;
// if hash function parametrized:
std::function<size_t(std::string)> hash; // #include <functional> for
};
I'm not 100% sure how efficient std::function really is, so for high performance code, you preferrably use your hash function h1 directly (not implenting constructor as illustrated above).
Coming to optimisations:
For the hash key I would prefer unsigned value: Negative indices are meaningless anyway, so why allow them at all? long long (signed or unsigned) might be a bad choice if testing system is a 32 bit system (might be unlikely, but still...). size_t covers both issues at once: it is unsigned and it is selected in size appropriately for given system (if interested in details: actually adjusted to address bus size, but on modern systems, this is equal to register size as well, which is what we need). Select type of pow to be the same.
deleteAll is implemented inefficiently: With each element you erase you move all the subsequent elements one position towards front. If you delete multiple elements, you do this repeatedly, so one single element can get moved multiple times. Better:
auto pos = vector.begin();
for(auto& pair : vector)
{
if(pair.first != keyToDelete)
*pos++ = std::move(s); // move semantics: faster than copying!
}
vector.erase(pos, vector.end());
This will move each element at most once, erasing all surplus elements in one single go. Appart from the final erasing (which you have to do explicitly then), this is more or less what std::remove and std::remove_if from algorithm library do as well. Are you allowed to use it? Then your code might look like this:
auto condition = [&keyToDelete](std::pair<std::string, std::string> const& p)
{ return p.first == keyToDelete; };
vector.erase(std::remove_if(vector.begin(), vector.end(), condition), vector.end());
and you profit from already highly optimised algorithm.
Just a minor performance gain, but still: You can spare variable initialisation, assignment and conditional branch (the latter one can be relatively expensive operation on some systems) within put if you simply return if an element is found:
//int checker = 0;
for(auto& pair : hashTable[hash]) // just a little more comfortable to write...
{
if(pair.first == key && pair.second == value)
return;
}
auto key_value = std::make_pair(key, value);
hashTable[hash].push_back(key_value);
Again, with algorithm library:
auto key_value = std::make_pair(key, value);
// same condition as above!
if(std::find_if(vector.begin(), vector.end(), condition) == vector.end();
{
vector.push_back(key_value);
}
Then less than 100000 operations does not indicate that each operation will require a separate key/value pair. We might expect that keys are added, removed, re-added, ..., so you most likely don't have to cope with 100000 different values. I'd assume your map is much too large (be aware that it requires initialisation of 100000 vectors as well). I'd assume a much smaller one should suffice already (possibly 1009 or 10007? You might possibly have to experiment a little...).
Keeping the inner vectors sorted might give you some performance boost as well:
put: You could use a binary search to find the two elements in between a new one is to be inserted (if one of these two is equal to given one, no insertion, of course)
delete: Use binary search to find the element to delete.
deleteAll: Find upper and lower bounds for elements to be deleted and erase whole range at once.
get: find lower and upper bound as for deleteAll, distance in between (number of elements) is a simple subtraction and you could print out the texts directly (instead of first building a long string). Which of outputting directly or creating a string really is more efficient is to be found out, though, as outputting directly involves multiple system calls, which in the end might cost previously gained performance again...
Considering your input loop:
Checking for eof() (only) is critical! If there is an error in the file, you'll end up in an endless loop, as the fail bit gets set, operator>> actually won't read anything at all any more and you won't ever reach the end of the file. This even might be the reason for your 20th test failing.
Additionally: You have line based input (each command on a separate line), so reading a whole line at once and only afterwards parse it will spare you some system calls. If some argument is missing, you will detect it correctly instead of (illegally) reading next command (e. g. put) as argument, similarly you won't interpret a surplus argument as next command. If a line is invalid for whatever reason (bad number of arguments as above or unknown command), you can then decide indiviually what you want to do (just ignore the line or abort processing entirely). So:
std::string line;
while(std::getline(std::cin, line))
{
// parse the string; if line is invalid, appropriate error handling
// (ignoring the line, exiting from loop, ...)
}
if(!std::cin.eof())
{
// some error occured, print error message!
}

Alternatives to standard functions of C++ to get speed optimization

Just to clarify that I also think the title is a bit silly. We all know that most built-in functions of the language are really well written and fast (there are ones even written by assembly). Though may be there still are some advices for my situation. I have a small project which demonstrates the work of a search engine. In the indexing phase, I have a filter method to filter out unnecessary things from the keywords. It's here:
bool Indexer::filter(string &keyword)
{
// Remove all characters defined in isGarbage method
keyword.resize(std::remove_if(keyword.begin(), keyword.end(), isGarbage) - keyword.begin());
// Transform all characters to lower case
std::transform(keyword.begin(), keyword.end(), keyword.begin(), ::tolower);
// After filtering, if the keyword is empty or it is contained in stop words list, mark as invalid keyword
if (keyword.size() == 0 || stopwords_.find(keyword) != stopwords_.end())
return false;
return true;
}
At first sign, these functions (alls are member functions of STL container or standard function) are supposed to be fast and not take many time in the indexing phase. But after profiling with Valgrind, the inclusive cost of this filter is ridiculous high: 33.4%. There are three standard functions of this filter take most of the time for that percentage: std::remove_if takes 6.53%, std::set::find takes 15.07% and std::transform takes 7.71%.
So if there are any thing I can do (or change) to reduce the instruction times cost by this filter (like using parallellizing or something like that), please give me your advice. Thanks in advance.
UPDATE: Thanks for all your suggestion. So in brief, I've summarize what I need to do is:
1) Merge tolower and remove_if into one by construct my own loop.
2) Use unordered_set instead of set for faster find method.
Thus I've chosen Mark_B's as the right answer.
First, are you certain that optimization and inlining are enabled when you compile?
Assuming that's the case, I would first try writing my own transformer that combines removing garbage and lower-casing into one step to prevent iterating over the keyword that second time.
There's not a lot you can do about the find without using a different container such as unordered_set as suggested in a comment.
Is it possible for your application that doing the filtering really just is a really CPU-intensive part of the operation?
If you use a boost filter iterator you can merge the remove_if and transform into one, something like (untested):
keyword.erase(std::transform(boost::make_filter_iterator(!boost::bind(isGarbage), keyword.begin(), keyword.end()),
boost::make_filter_iterator(!boost::bind(isGarbage), keyword.end(), keyword.end()),
keyword.begin(),
::tolower), keyword.end());
This is assuming you want the side effect of modifying the string to still be visible externally, otherwise pass by const reference instead and just use count_if and a predicate to do all in one. You can build a hierarchical data structure (basically a tree) for the list of stop words that makes "in-place" matching possible, for example if your stop words are SELECT, SELECTION, SELECTED you might build a tree:
|- (other/empty accept)
\- S-E-L-E-C-T- (empty, fail)
|- (other, accept)
|- I-O-N (fail)
\- E-D (fail)
You can traverse a tree structure like that simultaneously whilst transforming and filtering without any modifications to the string itself. In reality you'd want to compact the multi-character runs into a single node in the tree (probably).
You can build such a data structure fairly trivially with something like:
#include <iostream>
#include <map>
#include <memory>
class keywords {
struct node {
node() : end(false) {}
std::map<char, std::unique_ptr<node>> children;
bool end;
} root;
void add(const std::string::const_iterator& stop, const std::string::const_iterator c, node& n) {
if (!n.children[*c])
n.children[*c] = std::unique_ptr<node>(new node);
if (stop == c+1) {
n.children[*c]->end = true;
return;
}
add(stop, c+1, *n.children[*c]);
}
public:
void add(const std::string& str) {
add(str.end(), str.begin(), root);
}
bool match(const std::string& str) const {
const node *current = &root;
std::string::size_type pos = 0;
while(current && pos < str.size()) {
const std::map<char,std::unique_ptr<node>>::const_iterator it = current->children.find(str[pos++]);
current = it != current->children.end() ? it->second.get() : nullptr;
}
if (!current) {
return false;
}
return current->end;
}
};
int main() {
keywords list;
list.add("SELECT");
list.add("SELECTION");
list.add("SELECTED");
std::cout << list.match("TEST") << std::endl;
std::cout << list.match("SELECT") << std::endl;
std::cout << list.match("SELECTOR") << std::endl;
std::cout << list.match("SELECTED") << std::endl;
std::cout << list.match("SELECTION") << std::endl;
}
This worked as you'd hope and gave:
0
1
0
1
1
Which then just needs to have match() modified to call the transformation and filtering functions appropriately e.g.:
const char c = str[pos++];
if (filter(c)) {
const std::map<char,std::unique_ptr<node>>::const_iterator it = current->children.find(transform(c));
}
You can optimise this a bit (compact long single string runs) and make it more generic, but it shows how doing everything in-place in one pass might be achieved and that's the most likely candidate for speeding up the function you showed.
(Benchmark changes of course)
If a call to isGarbage() does not require synchronization, then parallelization should be the first optimization to consider (given of course that filtering one keyword is a big enough task, otherwise parallelization should be done one level higher). Here's how it could be done - in one pass through the original data, multi-threaded using Threading Building Blocks:
bool isGarbage(char c) {
return c == 'a';
}
struct RemoveGarbageAndLowerCase {
std::string result;
const std::string& keyword;
RemoveGarbageAndLowerCase(const std::string& keyword_) : keyword(keyword_) {}
RemoveGarbageAndLowerCase(RemoveGarbageAndLowerCase& r, tbb::split) : keyword(r.keyword) {}
void operator()(const tbb::blocked_range<size_t> &r) {
for(size_t i = r.begin(); i != r.end(); ++i) {
if(!isGarbage(keyword[i])) {
result.push_back(tolower(keyword[i]));
}
}
}
void join(RemoveGarbageAndLowerCase &rhs) {
result.insert(result.end(), rhs.result.begin(), rhs.result.end());
}
};
void filter_garbage(std::string &keyword) {
RemoveGarbageAndLowerCase res(keyword);
tbb::parallel_reduce(tbb::blocked_range<size_t>(0, keyword.size()), res);
keyword = res.result;
}
int main() {
std::string keyword = "ThIas_iS:saome-aTYpe_Ofa=MoDElaKEYwoRDastrang";
filter_garbage(keyword);
std::cout << keyword << std::endl;
return 0;
}
Of course, the final code could be improved further by avoiding data copying, but the goal of the sample is to demonstrate that it's an easily threadable problem.
You might make this faster by making a single pass through the string, ignoring the garbage characters. Something like this (pseudo-code):
std::string normalizedKeyword;
normalizedKeyword.reserve(keyword.size())
for (auto p = keyword.begin(); p != keyword.end(); ++p)
{
char ch = *p;
if (!isGarbage(ch))
normalizedKeyword.append(tolower(ch));
}
// then search for normalizedKeyword in stopwords
This should eliminate the overhead of std::remove_if, although there is a memory allocation and some new overhead of copying characters to normalizedKeyword.
The problem here isn't the standard functions, it's your use of them. You are making multiple passes over your string when you obviously need to be doing only one.
What you need to do probably can't be done with the algorithms straight up, you'll need help from boost or rolling your own.
You should also carefully consider whether resizing the string is actually necessary. Yeah, you might save some space but it's going to cost you in speed. Removing this alone might account for quite a bit of your operation's expense.
Here's a way to combine the garbage removal and lower-casing into a single step. It won't work for multi-byte encoding such as UTF-8, but neither did your original code. I assume 0 and 1 are both garbage values.
bool Indexer::filter(string &keyword)
{
static char replacements[256] = {1}; // initialize with an invalid char
if (replacements[0] == 1)
{
for (int i = 0; i < 256; ++i)
replacements[i] = isGarbage(i) ? 0 : ::tolower(i);
}
string::iterator tail = keyword.begin();
for (string::iterator it = keyword.begin(); it != keyword.end(); ++it)
{
unsigned int index = (unsigned int) *it & 0xff;
if (replacements[index])
*tail++ = replacements[index];
}
keyword.resize(tail - keyword.begin());
    // After filtering, if the keyword is empty or it is contained in stop words list, mark as invalid keyword
    if (keyword.size() == 0 || stopwords_.find(keyword) != stopwords_.end())
        return false;
    return true;
}
The largest part of your timing is the std::set::find so I'd also try std::unordered_set to see if it improves things.
I would implement it with lower level C functions, something like this maybe (not checking this compiles), doing the replacement in place and not resizing the keyword.
Instead of using a set for garbage characters, I'd add a static table of all 256 characters (yeah, it will work for ascii only), with 0 for all characters that are ok, and 1 for those who should be filtered out. something like:
static const char GARBAGE[256] = { 1, 1, 1, 1, 1, ...., 0, 0, 0, 0, 1, 1, ... };
then for each character in offset pos in const char *str you can just check if (GARBAGE[str[pos]] == 1);
this is more or less what an unordered set does, but will have much less instructions. stopwords should be an unordered set if they're not.
now the filtering function (I'm assuming ascii/utf8 and null terminated strings here):
bool Indexer::filter(char *keyword)
{
char *head = pos;
char *tail = pos;
while (*head != '\0') {
//copy non garbage chars from head to tail, lowercasing them while at it
if (!GARBAGE[*head]) {
*tail = tolower(*head);
++tail; //we only advance tail if no garbag
}
//head always advances
++head;
}
*tail = '\0';
// After filtering, if the keyword is empty or it is contained in stop words list, mark as invalid keyword
if (tail == keyword || stopwords_.find(keyword) != stopwords_.end())
return false;
return true;
}

How to implement C++ dictionary data structure without using STL

I have a project that I am doing, the main objective is to load a list of words (and lots of them 15k+) into a data structure and then do a search on that structure. I did a little research and as far as I can tell a hash table would be best for this (correct me if I am wrong, I looked into tries as well)
Here's the tricky part: I cannot use any STL's for this project. So as far as I can tell I am going to have to write my own hash table class or find one that pretty much works. I understand how has tables work on a basic level but I am not sure I know enough to write a whole one by myself.
I looked around Google and I could not find any suitable sample code.
My question is does anyone know how to do this in c++ and/or where I can find some code to start off with. I need 3 basic functions for the table: insert, search, remove.
Things to remember while you think about this:
The Number 1 Concern is SPEED! this needs to be lighting fast, no concern for system resources. From the reading that I have done, a hash table can do better than O(log n) Consider mutithreading?
Cannot use STL!
I think, sorted array of strings + binary search should be pretty efficient.
std::unordered_map is not STL
http://www.cs.auckland.ac.nz/software/AlgAnim/hash_tables.html
Not entirely clear on all of the restrictions, but assuming you cannot use anything from std, you could write a simple class like the one below to do the job. We will use an array of buckets to store the data, then use a hash function to turn a string into a number in the range 0...MAX_ELEMENTS. each bucket will hold a linked list of strings, so you can retrieve information again. Typically o(1) insertion and find.
Note that for a more effective solution, you may wish to use a vector rather than a fixed length array as I have gone for. There is also minimal error checking and other improvements, but this should get you started.
NOTE you will need to implement your own string hashing function, you can find plenty of these on the net.
class dictionary
{
struct data
{
char* word = nullptr;
data* next = nullptr;
~data()
{
delete [] word;
}
};
public:
const unsigned int MAX_BUCKETS;
dictionary(unsigned int maxBuckets = 1024)
: MAX_BUCKETS(maxBuckets)
, words(new data*[MAX_BUCKETS])
{
memset(words, 0, sizeof(data*) * MAX_BUCKETS);
}
~dictionary()
{
for (int i = 0; i < MAX_BUCKETS; ++i)
delete words[i];
delete [] words;
}
void insert(const char* word)
{
const auto hash_index = hash(word);
auto& d = words[hash_index];
if (d == nullptr)
{
d = new data;
copy_string(d, word);
}
else
{
while (d->next != nullptr)
{
d = d->next;
}
d->next = new data;
copy_string(d->next, word);
}
}
void copy_string(data* d, const char* word)
{
const auto word_length = strlen(word)+1;
d->word = new char[word_length];
strcpy(d->word, word);
printf("%s\n", d->word);
}
const char* find(const char* word) const
{
const auto hash_index = hash(word);
auto& d = words[hash_index];
if (d == nullptr)
{
return nullptr;
}
while (d != nullptr)
{
printf("checking %s with %s\n", word, d->word);
if (strcmp(d->word, word) == 0)
return d->word;
d = d->next;
}
return nullptr;
}
private:
unsigned int hash(const char* word) const
{
// :TODO: write your own hash function here
const unsigned int num = 0; // :TODO:
return num % MAX_BUCKETS;
}
data** words;
};
http://wikipedia-clustering.speedblue.org/trie.php
Seems the above link is down at the moment.
Alternative link:
https://web.archive.org/web/20160426224744/http://wikipedia-clustering.speedblue.org/trie.php
Source Code: https://web.archive.org/web/20160426224744/http://wikipedia-clustering.speedblue.org/download/libTrie-0.1.tgz

Sorting a file with 55K rows and varying Columns

I want to find a programmatic solution using C++.
I have a 900 files each of 27MB size. (just to inform about the enormity ).
Each file has 55K rows and Varying columns. But the header indicates the columns
I want to sort the rows in an order w.r.t to a Column Value.
I wrote the sorting algorithm for this (definitely my newbie attempts, you may say).
This algorithm is working for few numbers, but fails for larger numbers.
Here is the code for the same:
basic functions I defined to use inside the main code:
int getNumberOfColumns(const string& aline)
{
int ncols=0;
istringstream ss(aline);
string s1;
while(ss>>s1) ncols++;
return ncols;
}
vector<string> getWordsFromSentence(const string& aline)
{
vector<string>words;
istringstream ss(aline);
string tstr;
while(ss>>tstr) words.push_back(tstr);
return words;
}
bool findColumnName(vector<string> vs, const string& colName)
{
vector<string>::iterator it = find(vs.begin(), vs.end(), colName);
if ( it != vs.end())
return true;
else return false;
}
int getIndexForColumnName(vector<string> vs, const string& colName)
{
if ( !findColumnName(vs,colName) ) return -1;
else {
vector<string>::iterator it = find(vs.begin(), vs.end(), colName);
return it - vs.begin();
}
}
////////// I like the Recurssive functions - I tried to create a recursive function
///here. This worked for small values , say 20 rows. But for 55K - core dumps
void sort2D(vector<string>vn, vector<string> &srt, int columnIndex)
{
vector<double> pVals;
for ( int i = 0; i < vn.size(); i++) {
vector<string>meancols = getWordsFromSentence(vn[i]);
pVals.push_back(stringToDouble(meancols[columnIndex]));
}
srt.push_back(vn[max_element(pVals.begin(), pVals.end())-pVals.begin()]);
if (vn.size() > 1 ) {
vn.erase(vn.begin()+(max_element(pVals.begin(), pVals.end())-pVals.begin()) );
vector<string> vn2 = vn;
//cout<<srt[srt.size() -1 ]<<endl;
sort2D(vn2 , srt, columnIndex);
}
}
Now the main code:
for ( int i = 0; i < TissueNames.size() -1; i++)
{
for ( int j = i+1; j < TissueNames.size(); j++)
{
//string fname = path+"/gse7307_Female_rma"+TissueNames[i]+"_"+TissueNames[j]+".txt";
//string fname2 = sortpath2+"/gse7307_Female_rma"+TissueNames[i]+"_"+TissueNames[j]+"Sorted.txt";
string fname = path+"/gse7307_Male_rma"+TissueNames[i]+"_"+TissueNames[j]+".txt";
string fname2 = sortpath2+"/gse7307_Male_rma"+TissueNames[i]+"_"+TissueNames[j]+"4Columns.txt";
vector<string>AllLinesInFile;
BioInputStream fin(fname);
string aline;
getline(fin,aline);
replace (aline.begin(), aline.end(), '"',' ');
string headerline = aline;
vector<string> header = getWordsFromSentence(aline);
int pindex = getIndexForColumnName(header,"p-raw");
int xcindex = getIndexForColumnName(header,"xC");
int xeindex = getIndexForColumnName(header,"xE");
int prbindex = getIndexForColumnName(header,"X");
string newheaderline = "X\txC\txE\tp-raw";
BioOutputStream fsrt(fname2);
fsrt<<newheaderline<<endl;
int newpindex=3;
while ( getline(fin, aline) ){
replace (aline.begin(), aline.end(), '"',' ');
istringstream ss2(aline);
string tstr;
ss2>>tstr;
tstr = ss2.str().substr(tstr.length()+1);
vector<string> words = getWordsFromSentence(tstr);
string values = words[prbindex]+"\t"+words[xcindex]+"\t"+words[xeindex]+"\t"+words[pindex];
AllLinesInFile.push_back(values);
}
vector<string>SortedLines;
sort2D(AllLinesInFile, SortedLines,newpindex);
for ( int si = 0; si < SortedLines.size(); si++)
fsrt<<SortedLines[si]<<endl;
cout<<"["<<i<<","<<j<<"] = "<<SortedLines.size()<<endl;
}
}
can some one suggest me a better way of doing this?
why it is failing for larger values. ?
The primary function of interest for this query is Sort2D function.
thanks for the time and patience.
prasad.
I'm not sure why your code is crashing, but recursion in that case is only going to make the code less readable. I doubt it's a stack overflow, however, because you're not using much stack space in each call.
C++ already has std::sort, why not use that instead? You could do it like this:
// functor to compare 2 strings
class CompareStringByValue : public std::binary_function<string, string, bool>
{
public:
CompareStringByValue(int columnIndex) : idx_(columnIndex) {}
bool operator()(const string& s1, const string& s2) const
{
double val1 = stringToDouble(getWordsFromSentence(s1)[idx_]);
double val2 = stringToDouble(getWordsFromSentence(s2)[idx_]);
return val1 < val2;
}
private:
int idx_;
};
To then sort your lines you would call
std::sort(vn.begin(), vn.end(), CompareByStringValue(columnIndex));
Now, there is one problem. This will be slow because stringToDouble and getWordsFromSentence are called multiple times on the same string. You would probably want to generate a separate vector which has precalculated the values of each string, and then have CompareByStringValue just use that vector as a lookup table.
Another way you can do this is insert the strings into a std::multimap<double, std::string>. Just insert the entries as (value, str) and then read them out line-by-line. This is simpler but slower (though has the same big-O complexity).
EDIT: Cleaned up some incorrect code and derived from binary_function.
You could try a method that doesn't involve recursion. if your program crashes using the Sort2D function with large values, then your probably overflowing the stack (danger of using recursion with a large number of function calls). Try another sorting method, maybe using a loop.
sort2D crashes because you keep allocating an array of strings to sort and then you pass it by value, in effect using O(2*N^2) memory. If you really want to keep your recursive function, simply pass vn by reference and don't bother with vn2. And if you don't want to modify the original vn, move the body of sort2D into another function (say, sort2Drecursive) and call that from sort2D.
You might want to take another look at sort2D in general, since you are doing O(N^2) work for something that should take O(N+N*log(N)).
The problem is less your code than the tool you chose for the job. This is purely a text processing problem, so choose a tool good at that. In this case on Unix the best tool for the job is Bash and the GNU coreutils. On Windows you can use PowerShell, Python or Ruby. Python and Ruby will work on any Unix-flavoured machine too, but roughly all Unix machines have Bash and the coreutils installed.
Let $FILES hold the list of files to process, delimited by whitespace. Here's the code for Bash:
for FILE in $FILES; do
echo "Processing file $FILE ..."
tail --lines=+1 $FILE |sort >$FILE.tmp
mv $FILE.tmp $FILE
done