Background:
I got asked this question today in a online practice interview and I had a hard time figuring out a custom comparator to sort. Here is the question
Question:
Implement a document scanning function wordCountEngine, which receives a string document and returns a list of all unique words in it and their number of occurrences, sorted by the number of occurrences in a descending order. If two or more words have the same count, they should be sorted according to their order in the original sentence. Assume that all letters are in english alphabet. You function should be case-insensitive, so for instance, the words “Perfect” and “perfect” should be considered the same word.
The engine should strip out punctuation (even in the middle of a word) and use whitespaces to separate words.
Analyze the time and space complexities of your solution. Try to optimize for time while keeping a polynomial space complexity.
Examples:
input: document = "Practice makes perfect. you'll only
get Perfect by practice. just practice!"
output: [ ["practice", "3"], ["perfect", "2"],
["makes", "1"], ["youll", "1"], ["only", "1"],
["get", "1"], ["by", "1"], ["just", "1"] ]
My idea:
The first think I wanted to do was first get the string without punctuation and all in lower case into a vector of strings. Then I used an unordered_map container to store the string and a count of its occurrence. Where I got stuck was creating a custom comparator to make sure that if I have a string that has the same count then I would sort it based on its precedence in the actual given string.
Code:
#include <iostream>
#include <string>
#include <vector>
#include <unordered_map>
#include <sstream>
#include <iterator>
#include <numeric>
#include <algorithm>
using namespace std;
struct cmp
{
bool operator()(std::string& word1, std::string& word2)
{
}
};
vector<vector<string>> wordCountEngine( const string& document )
{
// your code goes here
// Step 1
auto doc = document;
std::string str;
remove_copy_if(doc.begin(), doc.end(), std::back_inserter(str),
std::ptr_fun<int, int>(&std::ispunct));
for(int i = 0; i < str.size(); ++i)
str[i] = tolower(str[i]);
std::stringstream ss(str);
istream_iterator<std::string> begin(ss);
istream_iterator<std::string> end;
std::vector<std::string> vec(begin, end);
// Step 2
std::unordered_map<std::string, int> m;
for(auto word : vec)
m[word]++;
// Step 3
std::vector<std::vector<std::string>> result;
for(auto it : m)
{
result.push_back({it.first, std::to_string(it.second)});
}
return result;
}
int main() {
std::string document = "Practice makes perfect. you'll only get Perfect by practice. just practice!";
auto result = wordCountEngine(document);
for(int i = 0; i < result.size(); ++i)
{
for(int j = 0; j < result[0].size(); ++j)
{
std::cout << result[i][j] << " ";
}
std::cout << "\n";
}
return 0;
}
If anyone can help me with learning how to build a custom comparator for this code I would really appreciate it.
You could use a std::vector<std::pair<std::string, int>>, with each pair representing one word and the number of occurrences of that word in the sequence. Using a vector will help to maintain the order of the original sequence when two or more words have the same count. Finally sort by occurrences.
#include <vector>
#include <algorithm>
#include <string>
#include <sstream>
std::vector<std::vector<std::string>> wordCountEngine(const std::string& document)
{
std::vector<std::pair<std::string, int>> words;
std::istringstream ss(document);
std::string word;
//Loop through words in sequence
while (getline(ss, word, ' '))
{
//Convert to lowercase
std::transform(word.begin(), word.end(), word.begin(), tolower);
//Remove punctuation characters
auto it = std::remove_if(word.begin(), word.end(), [](char c) { return !isalpha(c); });
word.erase(it, word.end());
//Find this word in the result vector
auto pos = std::find_if(words.begin(), words.end(),
[&word](const std::pair<std::string, int>& p) { return p.first == word; });
if (pos == words.end()) {
words.push_back({ word, 1 }); //Doesn't occur -> add it
}
else {
pos->second++; //Increment count
}
}
//Sort vector by word occurrences
std::sort(words.begin(), words.end(),
[](const std::pair<std::string, int>& p1, const std::pair<std::string, int>& p2) { return p1.second > p2.second; });
//Convert to vector<vector<string>>
std::vector<std::vector<std::string>> result;
result.reserve(words.size());
for (auto& p : words)
{
std::vector<std::string> v = { p.first, std::to_string(p.second) };
result.push_back(v);
}
return result;
}
int main()
{
std::string document = "Practice makes perfect. you'll only get Perfect by practice. just practice!";
auto result = wordCountEngine(document);
for (auto& word : result)
{
std::cout << word[0] << ", " << word[1] << std::endl;
}
return 0;
}
Output:
practice, 3
perfect, 2
makes, 1
youll, 1
only, 1
get, 1
by, 1
just, 1
In step2, try this:
std::vector<std::pair<std::pair<std::string, int>, int>> m;
Here, the pair stores the string and this index of its occurance, and the vector stores the pair and the count of its occurances. Write a logic, to sort according to the count first and then if the counts are same, then sort it according to the position of its occurance.
bool sort_vector(const std::pair<const std::pair<std::string,int>,int> &a, const std::pair<const std::pair<std::string,int>,int> &b)
{
if(a.second==b.second)
{
return a.first.second<b.first.second
// This will make sure that if the no of occurances of each string is same, then it will be sorted according to the position of the string
}
return a.second>b.second
//This will make sure that the strings are sorted in the order to return the string having higher no of occurances first.
}
You have to write a logic to count the number of occurrences and the index of occurrence of each word in the string.
Related
I'm currently trying to make a very fast anagram solver, and right now it's bottlenecked by the creation of the permutations. is there another way to do the whole program or to optimize the permutation creation?
here's my code:
#include <string>
#include <vector>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <unordered_set>
#include <vector>
#include <boost/asio/thread_pool.hpp>
#include <boost/asio/post.hpp>
void get_permutations(std::string s, std::vector<std::string> &permutations)
{
std::sort(s.begin(), s.end());
do
{
permutations.push_back(s);
} while (std::next_permutation(s.begin(), s.end()));
}
void load_file(std::unordered_set<std::string> &dictionary, std::string filename)
{
std::ifstream words(filename);
std::string element;
while (words >> element)
{
std::transform(element.begin(), element.end(), element.begin(), ::tolower);
dictionary.insert(element);
}
}
void print_valid(const std::unordered_set<std::string>& dictionary, const std::vector<std::string>::const_iterator start, const std::vector<std::string>::const_iterator stop)
{
for (auto iter = start; iter != stop; iter++)
{
if (dictionary.contains(*iter) == true)
{
std::cout << *iter << "\n";
}
}
}
int main()
{
const std::string s = "asdfghjklq";
std::vector<std::string> permutations;
boost::asio::thread_pool pool(2);
std::cout << "Loading english dictionary\n";
std::unordered_set<std::string> dictionary;
load_file(dictionary, "words");
std::cout << "Done\n";
//std::cout << "Enter the anagram: ";
//getline(std::cin, s);
clock_t start = clock();
get_permutations(s, permutations);
//std::cout << permutations.size() << std::endl;
std::cout << "finished permutations\n";
if (permutations.size() > 500000)
{
std::cout << "making new\n";
for (size_t pos = 0; pos < permutations.size(); pos += (permutations.size() / 3))
{
boost::asio::post(pool, [&dictionary, &permutations, pos] { print_valid(dictionary, (permutations.begin() + pos), (permutations.begin() + pos + (permutations.size() /3) ) ); });
}
pool.join();
}
else
{
print_valid(dictionary, permutations.begin(), permutations.end());
}
clock_t finish = clock();
double time_elapsed = (finish - start) / static_cast<double>(CLOCKS_PER_SEC);
std::cout << time_elapsed << "\n";
std::cout << permutations.size() << std::endl;
return 0;
}
the creation of permutations is in get_permutations
the thread pooling was something to test for very large sets of permutations
Think about how you would go about this by hand - how do you check if two words are anagrams of each other?
e.g.: banana <-> aaannb
How would you solve this on a piece of paper? Would you create all 720 permutations and check if any one matches? Or is there an easier, more intuitive way?
So what makes a word an anagram of another word, i.e. what condition needs to be satisfied?
It's all about letter counts. If both words contain an equal amount of all letters, they're anagrams of each other.
e.g.:
banana -> 3x a, 2x n, 1x b
aaannb -> 3x a, 2x n, 1x b
same letter counts so they must be anagrams!
So armed with this knowledge can you construct an algorithm that doesn't require iterating all possible permutations?
Solution
I'd only recommend to read this once you've tried to come up with your own optimized algorithm
You just need to build a lookup-table of letter-counts to dictionary words, e.g.:
1x a, 1x n -> ["an"]
3x a, 1x b, 2x n -> ["banana", "nanaba"]
1x a, 1x p, 1x r, 1x t -> ["part", "trap"]
... etc ...
then you can decompose your search word as well into letter counts, e.g. banana -> 3x a, 1x b, 2x n and search for the decomposition in your lookup table.
The result will be the list of words from your dictionary you can build with the given collection of letters - aka all possible anagrams for the given string.
aussuming some kind of structure named letter_counts that contains the letter composition the algorithm could look like:
std::vector<std::string> find_anagrams(std::vector<std::string> const& dictionary, std::string const& wordToCheck) {
// build a lookup map for letter composition -> word
std::unordered_map<letter_counts, std::vector<std::string>> compositionMap;
for(auto& str : dictionary)
compositionMap[letter_counts{str}].push_back(str);
// get all words that are anagrams of the given one
auto it = compositionMap.find(letter_counts{wordToCheck});
// no matches in dictionary
if(it == compositionMap.end())
return {};
// list of all anagrams
auto result = it->second;
// remove workToCheck from result if it is present
result.erase(std::remove_if(result.begin(), result.end(), [&wordToCheck](std::string const& str) { return str == wordToCheck; }), result.end());
return result;
}
This will run in O(n) time and has a space-complexity of O(n), with n being the number of words in the dictionary.
(It would be armortized O(1) time if you don't include the construction of the compositionMap as part of the algorithm)
In comparison to a permutation-based approach, that has O(n!) time complexity (or how i like to call it O(scary)).
Here's a full code example that only deals with letters a-z, but you can easily modify letter_counts to make it work with other characters as well:
godbolt example
#include <string_view>
#include <cctype>
#include <vector>
#include <string>
#include <unordered_map>
#include <iostream>
struct letter_counts {
static const int num_letters = 26;
int counts[num_letters];
explicit letter_counts(std::string_view str) : counts{0} {
for(char c : str) {
c = std::tolower(c);
if(c >= 'a' && c <= 'z')
counts[c - 'a']++;
}
}
};
bool operator==(letter_counts const& lhs, letter_counts const& rhs) {
for(int i = 0; i < letter_counts::num_letters; i++) {
if(lhs.counts[i] != rhs.counts[i]) return false;
}
return true;
}
template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
std::hash<T> hasher;
seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}
namespace std {
template<>
struct hash<letter_counts> {
size_t operator()(const letter_counts& letterCounts) const
{
size_t result = 0;
auto hasher = std::hash<int>{};
for(int i : letterCounts.counts)
hash_combine(result, hasher(i));
return result;
}
};
}
std::vector<std::string> find_anagrams(std::vector<std::string> const& dictionary, std::string const& wordToCheck) {
// build a lookup map for letter composition -> word
std::unordered_map<letter_counts, std::vector<std::string>> compositionMap;
for(auto& str : dictionary)
compositionMap[letter_counts{str}].push_back(str);
// get all words that are anagrams of the given one
auto it = compositionMap.find(letter_counts{wordToCheck});
// no matches in dictionary
if(it == compositionMap.end())
return {};
// list of all anagrams
auto result = it->second;
// remove workToCheck from result if it is present
result.erase(std::remove_if(result.begin(), result.end(), [&wordToCheck](std::string const& str) { return str == wordToCheck; }), result.end());
return result;
}
int main() {
std::vector<std::string> dict = {
"banana",
"nanaba",
"foobar",
"bazinga"
};
std::string word = "aaannb";
for(auto& str : find_anagrams(dict, word)) {
std::cout << str << std::endl;
}
}
The permutation method you have is way too slow, especially since the number of permutations of a string of n distinct characters scales super-exponentially. Try something like hashing and an equality predicate, where the hash is based on the sorted string, and the equality predicated only tests if the sorted version of 2 strings are equal. You can use boost::unordered_map to create custom hash functions and add words which fit the anagram to the key set.
Note that the number of combinations have a tendency to become very large very quickly. Two words are anagrams if you sort the characters of both words alphabetically and then the sorted strings match up. Based on that fact I made the following example that puts a dictionary into a multimap where it is possible to find all anagrams of a word quickly. It does this by using the alphabetically sorted input string as key into the map.
Live demo : https://onlinegdb.com/fXUVZruwq
#include <algorithm>
#include <iostream>
#include <locale>
#include <map>
#include <vector>
#include <set>
// create a class to hold anagram information
class anagram_dictionary_t
{
public:
// create a dictionary based on an input list of words.
template<typename std::size_t N>
explicit anagram_dictionary_t(const std::string (&words)[N])
{
for (std::string word : words)
{
auto key = make_key(word);
std::string lower{ word };
to_lower(lower);
m_anagrams.insert({ key, lower});
}
}
// find all the words that match the anagram
auto find_words(const std::string& anagram)
{
// get the unique key for input word
// this is done by sorting all the characters in the input word alphabetically
auto key = make_key(anagram);
// lookup all the words with the same key in the dictionary
auto range = m_anagrams.equal_range(key);
// create a set of found words
std::set<std::string> words;
for (auto it = range.first; it != range.second; ++it)
{
words.insert(it->second);
}
// return the words
return words;
}
// function to check if two words are an anagram
bool is_anagram(const std::string& anagram, const std::string& word)
{
auto words = find_words(anagram);
return (words.find(word) != words.end());
}
private:
// make a unique key out of an input word
// all anagrams should map to the same key value
static std::string make_key(const std::string& word)
{
std::string key{ word };
to_lower(key);
// two words are anagrams if they sort to the same key
std::sort(key.begin(), key.end());
return key;
}
static void to_lower(std::string& word)
{
for (char& c : word)
{
c = std::tolower(c, std::locale());
}
}
std::multimap<std::string, std::string> m_anagrams;
};
int main()
{
anagram_dictionary_t anagram_dictionary{ {"Apple", "Apricot", "Avocado", "Banana", "Bilberry", "Blackberry", "Blueberry" } };
std::string anagram{ "aaannb"};
auto words = anagram_dictionary.find_words(anagram);
std::cout << "input word = " << anagram << "\n found words : ";
for (const auto& word : words)
{
std::cout << word << "\n";
}
return 0;
}
Here is my approach where I have tried to split the string into words and then move forward but this is not working.
For instance, the input is: hey hi Mark hi mark
Then the output should be:
hey-1
hi-2
Mark-1
hi-2
mark-1
#include <iostream>
#include <string>
#include <vector>
using namespace std;
int main()
{
vector<vector<string> > strs;
string str;
cout<<"Enter your strings"<<endl;
getline(cin, str);
int len=str.length();
int j=0;
string s="";
for(int i=0; i<len; i++){
s+=str[i];
if(str[i+1]==' ' || i+1==len){
strs[0][j]=s;
s="";
j++;
i++;
}
}
strs[0][j]="NULL";
int freq;
vector<int> frequency;
for(int n=0; strs[0][n]!="NULL" ;n++){
freq=1;
for(int m=0; strs[0][m]!="NULL"; m++){
if(strs[0][n]==strs[0][m]){
freq++;
}
frequency.push_back(freq);
}
}
for(int x=0; strs[0][x]!="NULL"; x++){
cout<<strs[0][x]<<" - "<<frequency[x]<<endl;
}
return 0;
}
In your code, you have tried to access string elements via its index, which sometimes raises segmentation fault. To solve your problem, I came up with below mention solution.
#include <iostream>
#include <string>
#include <map>
/* getWordFrequency : function with return type std::map<std::string, int>
Param1: Input string
Param2: Default delimiter as " "(void space).
*/
std::map<std::string, int> getWordFrequency(const char *input_string, char c = ' ')
{
// Container to store output result
std::map<std::string, int> result;
// Iteration loop
do{
// Iteration pointer to iterate Character by Character
const char *begin = input_string;
// Continue loop until delimeter or pointer to self detects
while(*input_string != c && *input_string){
// Jump to next character
input_string++;
}
// Iterator for output result container
std::map<std::string, int>::iterator finder = result.find(std::string(begin, input_string));
// Find element using iterator
if(finder != result.end()){
// Element already present in resultunt map then increment frequency by one
finder->second += 1;
} else {
// If no element found then insert new word with frequency 1
result.insert(std::pair<std::string, int>(std::string(begin, input_string),1));
}
} while (0 != *input_string++); // Continue till end of string
return result;
}
int main()
{
// Your string
std::string input_string = "hey hi Mark hi mark";
// Container to catch result
std::map<std::string, int> frequency = getWordFrequency(input_string.c_str());
// Printing frequency of each word present in string
for (auto element : frequency){
std::cout << element.first << "-" << element.second << std::endl;
}
return 0;
}
So, I think your approach using 2 std::vectors is unfortunately wrong. You do not fully understand the difference between char and std::string.
You need to learn abaout that.
There is a more or less standard approach for counting something in a container, like a string or in general.
We can use an associative container like a std::map or a std::unordered_map. And here we associate a "key", in this case the "word" to count, with a value, in this case the count of the specific word.
And luckily the maps have a very nice index operator[]. This will look for the given key and, if found, return a reference to the value. If not found, then it will create a new entry with the key and return a reference to the new entry. So, in both cases, we will get a reference to the value used for counting. And then we can simply write:
std::map<std::string, int> counter{};
counter[word]++;
And that's it. More is not necessary. Please see:
#include <iostream>
#include <string>
#include <sstream>
#include <unordered_map>
int main() {
// Our test String
std::string text{"hey hi Mark hi mark"};
// Here, we will store the result of the counting
std::unordered_map<std::string, unsigned int> counter;
// Now count all words. This one line does all the counting
for (std::istringstream iss{text}; iss >> text; counter[text]++);
// Show result to user
for (const auto& [word, count] : counter) std::cout << word << '-' << count << ' ';
}
It seems also that splitting a string is some how difficult for you. Also here are many many ppossible solutions available.
One of the more sophisticated and more advanced solution is to use the std::sregex_token_iterator. With that you can easily iterate over patterns (described by a std::regex) in a string.
The final code will look nearly the same, but the result will be better, since for example punctuation can be excluded.
Example:
#include <iostream>
#include <string>
#include <unordered_map>
#include <regex>
#include <iterator>
using Iter = std::sregex_token_iterator;
const std::regex re{R"(\w+)"};
int main() {
// Our test String
std::string text{"hey hi Mark, hi mark."};
// Here, we will store the result of the counting
std::unordered_map<std::string, unsigned int> counter;
// Now count all words. This one line does all the counting
for (Iter word(text.begin(), text.end(), re); word != Iter(); counter[*word++]++);
// Show result to user
for (const auto& [word, count] : counter) std::cout << word << '-' << count << ' ';
}
I'm working on a program that looks at whether or not a particular word is an anagram using std:count however, I don't think my function logic is correct and I cannot seem to figure it out.
Assume there are the following words in the file:
Evil
Vile
Veil
Live
My code is as follows:
#include <iostream>
#include <vector>
#include <fstream>
#include <map>
using namespace std;
struct Compare {
std::string str;
Compare(const std::string& str) : str(str) {}
};
bool operator==(const std::pair<int, std::string>&p, const Compare& c) {
return c.str == p.second;
}
bool operator==(const Compare& c, const std::pair<int, std::string>&p) {
return c.str == p.second;
}
std::vector<std::string> readInput(ifstream& file)
{
std::vector<std::string> temp;
string word;
while (file >> word)
{
temp.push_back(word);
}
std::sort(temp.begin(), temp.end());
return temp;
}
int main(int argc, char *argv[]) {
string file = "testing.txt";
ifstream ss(file.c_str());
if(!ss.is_open())
{
cerr << "Cannot open the text file";
}
std::vector<std::string> words = readInput(ss);
std::map<int, std::string> wordsMap;
//std::map<std::string value, int key> values;
for(unsigned i=0; (i < words.size()); i++)
{
wordsMap[i] = words[i];
}
int count = std::count(wordsMap.begin(), wordsMap.end(), Compare("Evil"));
cout << count << endl;
}
I'm pretty sure it's just a case of my logic is wrong in the functions. I hope someone can help :)
The most simple approach would be
To check like following (pseudo code)
bool isAnagram(string s, string t) {return sort(s) == sort(t); }
So, use some think like following, no need of std::map
struct Compare {
std::string str;
Compare(const std::string& x) : str(x) {
std::sort(str.begin(),str.end()); std::transform(str.begin(),
str.end(),str.begin(), ::toupper);}
bool operator ()(const std::string& t)
{
std::string s= t;
std::transform(s.begin(), s.end(),s.begin(), ::toupper);
std::sort(s.begin(),s.end());
return s == str;
}
};
And then
int count = std::count_if(words.begin(), words.end(), Compare("Evil"));
See HERE
This is not the most efficient algorithm, but a quick change to your program that would work could be:
bool operator==(const std::pair<int, std::string>&p, const Compare& c) {
std::string a = c.str;
std::transform(a.begin(), a.end(), a.begin(), ::tolower);
std::sort(a.begin(), a.end());
std::string b = p.second;
std::transform(b.begin(), b.end(), b.begin(), ::tolower);
std::sort(b.begin(), b.end());
return a == b;
}
EDIT: It seems in your present code, you are checking whether the strings are exactly equal to each other (not anagrams).
INSTEAD:
For each word, make an array of 26 elements, each element corresponding to a letter of the alphabet. Parse each word character by character, and increase the count of the particular character in the respective array.
For example for evil, the array would be:
0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0. // It has 1's for letters e,v,i and l
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
You make this array for each word that you have. In your case, all the words will have the same array. You then compare these arrays element-wise and proceed accordingly.
Now you just need to see which words have the same corresponding array.
If you want to compare all the N words pair-wise, you can do so using two nested loops in O(N^2) complexity.
The complexity for comparing one pair is O(1).
Complexity of creating the arrays = O(L) where L is the length of the string.
Consider the following:
map<string, set<string>> anagrams;
for (auto word : words)
anagrams[sort(word)].insert(word);
const set<string>& find_anagrams(const string& word)
{
return anagrams[word];
}
When you have a lot of words that are relatively short (or if you can work with large number libs), then you can use a solution similar to what I wrote here -
Generate same unique hash code for all anagrams
Essentially - map each character to a unique prime number (doesn't have to be big, you can map the entire ABC into primes up to 101), and for each word multiply the primes received from it characters. Since multiplication is commutative, anagrams would give the same result, so you just compare that result, hash it, or do whatever you want
Keep in mind that for long words the values would grow pretty fast, so you might need a big numbers lib
I am trying to find an optimal way to find a pattern of a string and compare. For example, I have s1 = "red blue blue red red yellow", and s2 = "abbaac". This would match because they have the same pattern.
My thinking of doing this would be iterate through s1 and s2, use a vector container to record the corresponding place's count (for s1 would be corresponding word's count, and for s2 would be corresponding letter's count) and then compare.
This is really inefficient because I iterator through the whole s1 and s2. If s1 = "red blue red red red yellow" and s2 = "abbaac". After the third red, there is essentially no point to keep iterating it through.
So, any better idea on how to do this?
Code:
#include "stdafx.h"
#include <iostream>
#include <string>
#include <array>
#include <sstream>
#include <vector>
#include <algorithm>
using namespace std;
vector<int> findPattern(string pattern){
vector<int> counts;
for (int i = 0; i < pattern.size(); ++i){
counts.push_back(0);
int counter = 0;
for (int j = i + 1; j < pattern.size(); ++j){
if (pattern[i] == pattern[j]){
++counter;
}
counts[i] = counter;
}
}
return counts;
}
vector<int> findPatternLong(string pattern){
istringstream iss (pattern);
string word;
vector<string> v;
while (iss >> word){
v.push_back(word);
}
vector<int> counts2;
for (int i = 0; i < v.size(); ++i){
counts2.push_back(0);
int counter = 0;
for (int j = i + 1; j < v.size(); ++j){
if (v[i] == v[j]){
++counter;
}
counts2[i] = counter;
}
}
return counts2;
}
int main(int argc, char * argv[]){
vector<int> v1 = findPattern("abbaac");
vector<int> v2 = findPatternLong("red blue blue red red yellow");
if (v1.size() == v2.size()){
for (int i = 0; i < v1.size(); ++i){
if (v1[i] != v2[i]){
cout << "Unmatch" << endl;
return false;
}
}
cout << "match" << endl;
return true;
} else
cout << "Unmatch" << endl;
return 0;
}
#Tony beat me with same idea, but since I already typed this, here it goes :-)
First of all, don't worry so much about efficiency and focus on correctness: indeed, premature optimization is the root of all evil. Write test cases and make sure your code passes each one.
Second, I think I would start with a maps/dictionary D, and have a loop in which I'd parse one element of each string (a word in s1, let's call it "w" and a character in your s2, say "c"), choose one element as the key (say the "c" characters) and check if "c" already has an entry in the dictionary:
If we ran out of elements at the same time, the strings match
If we ran out of elements on one side, we know there's no match
If "c" doesn't have an entry in D, store the current values: D[c] = w;
else if "c" already has an entry, check if the entry matches the value found on the string: is D[c] == w? If it doesn't we know there's no match
If that code works, then optimization could start. In your example, maybe we could use a simple array instead of a dictionary because ASCII characters are a small finite set.
It's not the most efficient code, but close to simplest:
std::map<char, std::string> letter_to_word;
std::set<std::string> words_seen;
std::istringstream iss(s1);
std::string word;
for (std::string::size_t i = 0; i < s2.size(); ++i)
{
if (!(iss >> word))
return false; // more letters than words
std::string& expected_word = letter_to_word[s2[i]];
if (expected_word == "")
{
// if different letters require different words...
if (words_seen.find(word) != words_seen.end())
return false; // multiple letters for same word
words_seen.insert(word);
expected_word = word; // first time we've seen letter, remember associated word
}
else if (expected_word != word)
return false; // different word for same letter
}
return !(iss >> word); // check no surplus words
You don't need two vectors.
When processing the second string, compare the count of the first pattern, to the first entry. If it matches, keep going otherwise stop. Repeat for the rest of the patterns in the second string.
You don't need to store the pattern counts of the second string.
EDIT
I just read that the question had the patterns in a string and this answer pertains to comparing collections of varying types. I suppose the answer still holds a little water if the 2 input strings were first converted :)
I would not say this is the most efficient solution, but I like how it is extensible.
Firstly, there is the PatternResult class. It stores the result of a pattern:
class PatternResult {
private:
std::vector<int> result_;
public:
PatternResult(const std::vector<int>& result) : result_(result) {
};
bool operator == (const PatternResult& rhs) const {
if(result_.size() != rhs.result_.size())
return false;
else {
for(std::vector<int>::size_type r(0);
r < result_.size();
++r) {
if(result_[r] != rhs.result_[r])
return false;
};
return true;
};
};
}; // eo class PatternResult
It takes a vector of integers, the value of which denotes it's value. We overload == to compare two pattern results, meaning they have the same sequence irrespective of the source data.
Then we need a pattern counter that can assign the same sequence numbers, but take any type:
template<class T>
class PatternCounter {
private:
typedef std::vector<T> vec_type;
typedef std::map<T, int> map_type;
map_type found_;
int counter_;
public:
PatternCounter() : counter_(1) {
};
PatternResult count(const vec_type& input ){
std::vector<int> ret;
for(vec_type::const_iterator cit(input.begin());
cit != input.end();
++cit) {
if(found_.find(*cit) != found_.end()) {
ret.push_back(found_[*cit]);
} else {
found_[*cit] = counter_;
ret.push_back(counter_);
++counter_;
};
};
return PatternResult(ret);
};
};
And we're done. Test code:
std::vector<std::string> inp1;
inp1.push_back("red");
inp1.push_back("blue");
inp1.push_back("blue");
inp1.push_back("red");
inp1.push_back("yellow");
std::vector<char> inp2;
inp2.push_back('a');
inp2.push_back('b');
inp2.push_back('b');
inp2.push_back('a');
inp2.push_back('c');
PatternCounter<std::string> counter1;
PatternCounter<char> counter2;
PatternResult res1(counter1.count(inp1));
PatternResult res2(counter2.count(inp2));
if(res1 == res2) {
// pattern sequences are equal
};
Note this was quick and dirty, I am sure it could be made more efficient.
Basically, you want to check that the sequence follows the same order. You're not worried about what the sequence actually is: first second first first third is good enough. Now, you could do this with a container that maps a string to an int in some way. However, you would be storing copies of each string and you're ignoring the fact that you don't really care about string values. For tiny test cases, this wouldn't matter, but for a large sequence of long words, you're quickly chewing up memory when you don't need to.
So let's use the fact that we don't care about the string values or about storing them. If that's the case, we can use a hash function to transform our strings to simple size_t values with a fairly strong guarantee that they're going to be unique. However, the hashes are not sequential and we will need to retrieve the sequence based on the hash value. The simplest way to record their sequence is to map them to the size of the map for easy lookup. The last piece of the puzzle is to check that the hashes are in the same sequence.
I'm also assuming that you don't just want to compare a sentence with a word, but maybe 2 words or two sentences. Here's a quick C++11 sample that basically does the above and doesn't hold anything in memory unless it needs to.
Ofcourse, this can still be optimized more - for example, executing things parallel.
#include <iostream>
#include <vector>
#include <string>
#include <map>
#include <sstream>
/*
s1 = "red blue blue red red yellow"
s2 = "abbaac"
This would match because they have the same pattern.
*/
typedef std::map<size_t,size_t> hash_map;
typedef std::vector<std::string> wordlist;
size_t ordered_symbol( hash_map &h, std::string const& word )
{
std::hash<std::string> hash_fn;
size_t hash = hash_fn(word);
if(h.find(hash)==h.end())
{
size_t const sequence = h.size();
h[hash] = sequence;
return sequence;
}
return h[hash];
}
wordlist create_wordlist( std::string const& str )
{
if(str.find_first_of(' ') != std::string::npos)
{
wordlist w1;
std::stringstream sstr(str);
std::string s;
while(sstr>>s)
w1.push_back(s);
return w1;
}
wordlist w2;
for(auto i : str)
{
std::string s;
s.append(1,i);
w2.push_back(s);
}
return w2;
}
bool pattern_matches( std::string const& s1, std::string const& s2 )
{
wordlist const w1 = create_wordlist(s1);
wordlist const w2 = create_wordlist(s2);
if(w1.size()!=w2.size())
return false;
hash_map h1,h2;
for( size_t i = 0; i!=w1.size(); ++i)
if(ordered_symbol(h1,w1[i])!=ordered_symbol(h2,w2[i]))
return false;
return true;
}
void test( std::string const& s1, std::string const& s2 )
{
std::cout<<"["<<s1<<"] "
<<(pattern_matches(s1,s2)? "<==>" : "<=!=>")
<<"["<<s2<<"]\n";
}
int main()
{
test("red blue blue red red yellow","abbaac");
test("red blue blue red red yellow","first second second first first third");
test("abbaac","12211g");
test("abbaac","red blue blue red red yellow");
test("abbgac","red blue blue red red yellow");
return 0;
}
//Output:
//[red blue blue red red yellow] <==>[abbaac]
//[red blue blue red red yellow] <==>[first second second first first third]
//[abbaac] <==>[12211g]
//[abbaac] <==>[red blue blue red red yellow]
//[abbgac] <=!=>[red blue blue red red yellow]
EDIT: Here's a non C++11 version that should work on VS2010. However, since C++03 does not include a string hash function in the standard library, this example uses a hash function taken from stack overflow. A much better hash function to use would be this one if you have access to the boost libraries.
I am stuck at solving Accelerated C++ exercise 8-5 and I don't want to miss a single exercise in this book.
Accelerated C++ Exercise 8-5 is as follows:
Reimplement the gen_sentence and xref functions from Chapter 7 to use
output iterators rather than putting their entire output in one data
structure. Test these new versions by writing programs that attach the
output iterator directly to the standard output, and by storing the
results in list <string> and map<string, vector<int> >, respectively.
To understand scope of this question and current knowledge in this part of the book - this exercise is part of chapter about generic function templates and iterator usage in templates. Previous exercise was to implement simple versions of <algorithm> library functions, such as equal, find, copy, remove_copy_if etc.
If I understand correctly, I need to modify xref function so it:
Use output iterator
Store results in map<string, vector<int> >
I tried to pass map iterator as back_inserter(), .begin(), .end() to this function, but was not able to compile it. Answer here explains why.
xref function as in Chapter 7:
// find all the lines that refer to each word in the input
map<string, vector<int> >
xref(istream& in,
vector<string> find_words(const string&) = split)
{
string line;
int line_number = 0;
map<string, vector<int> > ret;
// read the next line
while (getline(in, line)) {
++line_number;
// break the input line into words
vector<string> words = find_words(line);
// remember that each word occurs on the current line
for (vector<string>::const_iterator it = words.begin();
it != words.end(); ++it)
ret[*it].push_back(line_number);
}
return ret;
}
Split implementation:
vector<string> split(const string& s)
{
vector<string> ret;
typedef string::size_type string_size;
string_size i = 0;
// invariant: we have processed characters `['original value of `i', `i)'
while (i != s.size()) {
// ignore leading blanks
// invariant: characters in range `['original `i', current `i)' are all spaces
while (i != s.size() && isspace(s[i]))
++i;
// find end of next word
string_size j = i;
// invariant: none of the characters in range `['original `j', current `j)' is a space
while (j != s.size() && !isspace(s[j]))
++j;
// if we found some nonwhitespace characters
if (i != j) {
// copy from `s' starting at `i' and taking `j' `\-' `i' chars
ret.push_back(s.substr(i, j - i));
i = j;
}
}
return ret;
}
Please help to understand what am i missing.
I found more details on the exercise, here: https://stackoverflow.com/questions/5608092/accelerated-c-exercise-8-5-wording-help:
template <class Out>
void gen_sentence( const Grammar& g, string s, Out& out )
USAGE:
std::ostream_iterator<string> out_str (std::cout, " ");
gen_sentence( g, "<sentence>", out_str );
template <class Out, class In>
void xref( In& in, Out& out, vector<string> find_words( const string& ) = split )
USAGE:
std::ostream_iterator<string> out_str (std::cout, " ");
xref( cin, out_str, find_url ) ;
Frankly, I have to come to the conclusion that that question is ill-posed, specifically where they specified the new interface for xref: xref should result in a map. However, using output iterators would imply using std::inserter(map, map.end()) in this case. While you can write a compiling version of the code, this will not do what you expect since map::insert will simply ignore any insertions with duplicated keys.
If the goal of xref is only to link the words to the line number of their first appearance this would still be ok, but I have a feeling that the author of the exercise simply missed this subtler point :)
Here is the code anyways (note that I invented a silly implementation for split, because it was both missing and required):
#include <map>
#include <vector>
#include <iostream>
#include <sstream>
#include <fstream>
#include <algorithm>
#include <iterator>
std::vector<std::string> split(const std::string& str)
{
std::istringstream iss(str);
std::vector<std::string> result;
std::copy(std::istream_iterator<std::string>(iss),
std::istream_iterator<std::string>(),
std::back_inserter(result));
return result;
}
// find all the lines that refer to each word in the input
template <typename OutIt>
OutIt xref(std::istream& in,
OutIt out,
std::vector<std::string> find_words(const std::string&) = split)
{
std::string line;
int line_number = 0;
// read the next line
while (getline(in, line)) {
++line_number;
// break the input line into words
std::vector<std::string> words = find_words(line);
// remember that each word occurs on the current line
for (std::vector<std::string>::const_iterator it = words.begin();
it != words.end(); ++it)
*out++ = std::make_pair(*it, line_number);
}
return out;
}
int main(int argc, const char *argv[])
{
std::map<std::string, int> index;
std::ifstream file("/tmp/test.cpp");
xref(file, std::inserter(index, index.end()));
#if __GXX_EXPERIMENTAL_CXX0X__
for(auto& entry: index)
std::cout << entry.first << " first found on line " << entry.second << std::endl;
#else
for(std::map<std::string, int>::const_iterator it = index.begin();
it != index.end();
++it)
{
std::cout << it->first << " first found on line " << it->second << std::endl;
}
#endif
return 0;
}