Finding the most(multiple) common word in C++ using unordered_map approach - c++

Find the most common word from a text input, excluding a list of given words. If there are multiple maximum words, display all of them.
My method words for 21/24 test cases, I cannot seem to think of the 3 test cases that I am missing.
I am adding the code that I have right now, which is efficient according to me. I don't want another way of implementing it right now (although suggestions are most welcome), I would just like to pick your brain about the possible test cases I am missing.
vector<string> mostCommonWord(string paragraph, vector<string>& banned) {
unordered_map<string, int>m;
for(int i = 0; i < paragraph.size();){
string s = "";
while(i < paragraph.size() && isalpha(paragraph[i])) s.push_back(tolower(paragraph[i++])); // go through till you find one word completely
while(i < paragraph.size() && !isalpha(paragraph[i])) i++; // avoid all the white spaces and other characters
m[s]++; // include the word found and increment its count
}
for(auto x: banned) m[x] = 0; // make the count of all the banned words to be 0
vector<string> result;
string res = "";
int count = INT_MIN;
// find the maximum count
for(auto x: m)
if(x.second > count) count = x.second;
// we might have the case where all the words were in banned words, which would result the count == -1, so return an empty vector in this case
if(count <= 0) return result;
// add the words corresponding to that to the final vector<string>
for(auto x: m)
if(x.second == count) result.push_back(x.first);
return result;
}
It works for all the scenarios I can think, but fails 3 test cases.
I am not given access to those test cases, would just like to have a discussion of what it could possibly be!

Are you sure in the fact that other chars (digits) should be treated as word delimiters?
If paragraph starts with a whitespace or not an alphabetical char you will insert the empty string into the map: m[""] = 1.

Related

How to get all dictionary words from a list of letters?

I have an input string, like "fairy", and I need to get the English words that can be formed from it. Here's an example:
5: Fairy
4: Fray, Airy, Fair, Fiar
3: Fay, fry, arf, ary, far, etc.
I have an std::unordered_set<std::string> of dictionary words so I can easily iterate over it. I've created permutations before as shown below:
std::unordered_set<std::string> permutations;
// Finds every permutation (non-duplicate arrangement of letters)
std::sort(letters.begin(), letters.end());
do {
// Check if the word is a valid dictionary word first
permutations.insert(letters);
} while (std::next_permutation(letters.begin(), letters.end()));
That's perfect for length 5. I can check each letters to see if it matches and I end up with "fairy", which is the only 5 letter word that can be found from those letters.
How could I find words of a smaller length? I'm guessing it has to do with permutations as well, but I wasn't sure how to implement it.
You can keep an auxiliary data structure and add a special symbol to mark an end-of-line:
#include <algorithm>
#include <string>
#include <set>
#include <list>
#include <iostream>
int main()
{
std::list<int> l = {-1, 0 ,1, 2, 3, 4};
std::string s = "fairy";
std::set<std::string> words;
do {
std::string temp = "";
for (auto e : l)
if (e != -1) temp += s[e];
else break;
words.insert(temp);
} while(std::next_permutation(l.begin(), l.end()));
}
Here the special symbol is -1
Okay, you have to ask yourself a question. Can you reuse letters? For instance, if you're given the word friend, is fee legal? Friend has 1 e and fee has 2. That's an important but minor detail.
Algorithm 1: Brute Force
You can iterate over your entire list of possible words and write a method "does this word consist only of letters in this other word"? If so, add it to your final list.
That algorithm changes very slightly based on your answer to the first question, but it's not hard to write.
Algorithm 2: Recursive Approach
Create a method addWords().
/**
* letters is the list of letters you're allowed to use
* word may not be empty
*/
void addWords(string letters, string word) {
size_t length = word.length();
for (int index = 0; index < length; ++index) {
string newWord = word + letters[index];
string remainingLetters = letters.substr(0, index) + letters(index + 1);
// if newword is in your dictionary, add it to the output
...
addWords(remainingLetters, newWord);
}
}
Let's look how this works with addWords("fairy", "") --
First loop: add f to the empty string and check if f is a word.
Then recurse with addWords("airy", f"). We'll look at recursion shortly.
Second loop: add a to the empty string and check if a is a word. It is, so we'll add it to the output and recurse with addWords("firy", "a").
Repeat, checking each one-letter word (5 times total).
Now, let's look at one level of recursion -- addWords("airy", "f"). Now, we're going to try in order fa, fi, etc. Then we'll recurse again with something like addWords("iry", "fa") (etc).
From recursing the second loop, we would try words beginning with a instead of f. So we would end up testing af, ai, etc.
This works if the answer to your first question is "no, we don't reuse letters". This method does NOT work if the answer is yes.
You can check every prefix (or suffix, you need only be consistent) of each permutation. This will consider some substrings multiple times, but it's a simple change.
std::unordered_set<std::string> permutations;
// Finds every permutation (non-duplicate arrangement of letters)
std::sort(letters.begin(), letters.end());
do {
std::string_view view = letters;
for (auto i = 1; i < view.size(); ++i) {
auto prefix = view.substr(0, i);
// check if prefix is dictionary word
permutations.insert(prefix);
}
} while (std::next_permutation(letters.begin(), letters.end()));
Algorithmically, you can use something like a counter to generate all subsets of your word, and then find all the permutations:
For example:
00000 ---> null
00001 ---> y
00010 ---> r
00011 ---> ry
00100 ---> i
00101 ---> iy
...
11110 ---> Fair
11111 ---> Fairy
Note: Now, do your permutation function for each word to generate other orders of the chars. See here for the permutation.
For implementing the counter, use something like a boolean array, and change the lowest bit, and update others if it needs. In each level, choose those "chars" that their indices are true in your boolean array.
Trie might be an appropriate structure to store the word.
I suggest also to use sorted anagram as "key" instead of the word directly:
class Node
{
public:
std::map<char, Node> children; // Might be an array<std::unique_ptr<Node>, 26>
std::set<std::string> data; // List of word valid with anagram
// Traditionally, Trie would use instead ` bool endOfWord = false;`
Node() = default;
const Node* get(char c) const
{
auto it = children.find(c);
if (it == children.end()) {
return nullptr;
}
return &it->second;
}
};
class Trie
{
Node root;
public:
void add(const std::string& word)
{
std::string sorted = word;
std::sort(sorted.begin(), sorted.end());
Node* node = &root;
for (const char c : sorted) {
node = &node->children[c];
}
node->data.insert(word);
}
// ...
};
Then to print all anagrams you might do:
void print_valid_words(std::string letters) const
{
std::sort(letters.begin(), letters.end());
print_valid_words(&root, letters);
}
private:
void print_valid_words(const Node* current, std::string_view letters) const
{
if (current == nullptr) return;
for (auto word : current->data) {
std::cout << word << std::endl;
}
for (std::size_t i = 0; i < letters.size(); ++i)
{
if (i == 0 || letters[i] != letters[i - 1]) {
print_valid_words(current->get(letters[i]), letters.substr(i + 1));
}
}
}
Demo

Given an integer K and a matrix of size t x t. construct a string s consisting of first t lowercase english letters such that the total cost of s is K

I'm solving this problem and stuck halfway through, looking for help and a better method to tackle such a problem:
problem:
Given an integer K and a matrix of size t x t. we have to construct a string s consisting of the first t lowercase English letters such that the total cost of s is exactly K. it is guaranteed that there exists at least one string that satisfies given conditions. Among all possible string s which is lexicographically smallest.
Specifically the cost of having the ith character followed by jth character of the English alphabet is equal to cost[i][j].
For example, the cost of having 'a' followed by 'a' is denoted by cost[0][0] and the cost of having 'b' followed by 'c' is denoted by cost[1][3].
The total cost of a string is the total cost of two consecutive characters in s. for matrix cost is
[1 2]
[3 4],
and the string is "abba", then we have
the cost of having 'a' followed by 'b' is is cost[0][1]=2.
the cost of having 'b' followed by 'b' is is `cost0=4.
the cost of having 'b' followed by 'a' is cost0=3.
In total, the cost of the string "abba" is 2+4+3=9.
Example:
consider, for example, K is 3,t is 2, the cost matrix is
[2 1]
[3 4]
There are two strings that its total cost is 3. Those strings are:
"aab"
"ba"
our answer will be "aab" as it is lexicographically smallest.
my approach
I tried to find and store all those combinations of i, j such that it sums up to desired value k or is individual equals k.
for above example
v={
{2,1},
{3,4}
}
k = 3
and v[0][0] + v[0][1] = 3 & v[1][0] = 3 . I tried to store the pairs in an array like this std::vector<std::vector<std::pair<int, int>>>. and based on it i will create all possible strings and will store in the set and it will give me the strings in lexicographical order.
i stucked by writing this much code:
#include<iostream>
#include<vector>
int main(){
using namespace std;
vector<vector<int>>v={{2,1},{3,4}};
vector<pair<int,int>>k;
int size=v.size();
for(size_t i=0;i<size;i++){
for(size_t j=0;j<size;j++){
if(v[i][j]==3){
k.push_back(make_pair(i,j));
}
}
}
}
please help me how such a problem can be tackled, Thank you. My code can only find the individual [i,j] pairs that can be equal to desired K. I don't have idea to collect multiple [i,j] pairs which sum's to desired value and it also appears my approach is totally naive and based on brute force. Looking for better perception to solve the problems and implement it in the code. Thank you.
This is a backtracking problem. General approach is :
a) Start with the "smallest" letter for e.g. 'a' and then recurse on all the available letters. If you find a string that sums to K then you have the answer because that will be the lexicographically smallest as we are finding it from smallest to largest letter.
b) If not found in 'a' move to the next letter.
Recurse/backtrack can be done as:
Start with a letter and the original value of K
explore for every j = 0 to t and reducing K by cost[i][j]
if K == 0 you found your string.
if K < 0 then that path is not possible, so remove the last letter in the string, try other paths.
Pseudocode :
string find_smallest() {
for (int i = 0; i < t; i++) {
s = (char)(i+97)
bool value = recurse(i,t,K,s)
if ( value ) return s;
s = ""
}
return ""
}
bool recurse(int i, int t, int K, string s) {
if ( K < 0 ) {
return false;
}
if ( K == 0 ) {
return true;
}
for ( int j = 0; j < t; j++ ) {
s += (char)(j+97);
bool v = recurse(j, t, K-cost[i][j], s);
if ( v ) return true;
s -= (char)(j+97);
}
return false;
}
In your implementation, you would probably need another vector of vectors of pairs to explore all your candidates. Also another vector for updating the current cost of each candidate as it builds up. Following this approach, things start to get a bit messy (IMO).
A more clean and understandable option (IMO again) could be to approach the problem with recursivity:
#include <iostream>
#include <vector>
#define K 3
using namespace std;
string exploreCandidate(int currentCost, string currentString, vector<vector<int>> &v)
{
if (currentCost == K)
return currentString;
int size = v.size();
int lastChar = (int)currentString.back() - 97; // get ASCII code
for (size_t j = 0; j < size; j++)
{
int nextTotalCost = currentCost + v[lastChar][j];
if (nextTotalCost > K)
continue;
string nextString = currentString + (char)(97 + j); // get ASCII char
string exploredString = exploreCandidate(nextTotalCost, nextString, v);
if (exploredString != "00") // It is a valid path
return exploredString;
}
return "00";
}
int main()
{
vector<vector<int>> v = {{2, 1}, {3, 4}};
int size = v.size();
string initialString = "00"; // reserve first two positions
for (size_t i = 0; i < size; i++)
{
for (size_t j = 0; j < size; j++)
{
initialString[0] = (char)(97 + i);
initialString[1] = (char)(97 + j);
string exploredString = exploreCandidate(v[i][j], initialString, v);
if (exploredString != "00") { // It is a valid path
cout << exploredString << endl;
return 0;
}
}
}
}
Let us begin from the main function:
We define our matrix and iterate over it. For each position, we define the corresponding sequence. Notice that we can use indices to get the respective character of the English alphabet, knowing that in ASCII code a=97, b=98...
Having this initial sequence, we can explore candidates recursively, which lead us to the exploreCandidate recursive function.
First, we want to make sure that the current cost is not the value we are looking for. If it is, we leave immediately without even evaluating the following iterations for candidates. We want to do this because we are looking for the lexicographically smallest element, and we are not asked to provide information about all the candidates.
If the cost condition is not satisfied (cost < K), we need to continue exploring our candidate, but not for the whole matrix but only for the row corresponding to the last character. Then we can encounter two scenarios:
The cost condition is met (cost = K): if at some point of recursivity the cost is equal to our value K, then the string is a valid one, and since it will be the first one we encounter, we want to return it and finish the execution.
The cost is not valid (cost > K): If the current cost is greater than K, then we need to abort this branch and see if other branches are luckier. Returning a boolean would be nice, but since we want to output a string (or maybe not, depending on the statement), an option could be to return a string and use "00" as our "false" value, allowing us to know whether the cost condition has been met. Other options could be returning a boolean and using an output parameter (passed by reference) to contain the output string.
EDIT:
The provided code assumes positive non-zero costs. If some costs were to be zero you could encounter infinite recursivity, so you would need to add more constraints in your recursive function.

Check String For Consecutive Pairs C++

I'm looking to write a C++ console app that takes in lines of text from a .txt file, which I have done, now what I need to do is check each line for consecutive pairs of letters
"For example, the word “tooth” has one pair of double letters, and the word “committee” has two pairs of consecutive double letters."
Should I convert each line into a Cstring and loop through each character? I really don't know where to start with this.
I'm not looking for someone to write out the entire solution, I just need to know how to start this.
You could loop through the string from start to the second last char and compare 2 chars at a time. In C++17 you have std::string_view which is handy.
#include <string_view>
size_t pair_count(std::string_view s) {
size_t rv = 0; // the result
for(size_t idx = 0; idx < s.size() - 1; ++idx) {
// compare s[idx] and s[idx+1]
// if they are equal, increase rv by one
// and increase idx by one (if you want "aaa" to count as 1 and not 2)
}
return rv;
}

Is String a Permutation of list of strings

A list of words is given and a bigger string given how can we find whether the string is a permutation of the smaller strings.
eg- s= badactorgoodacting dict[]={'actor','bad','act','good'] FALSE
eg- s= badactorgoodacting dict[]={'actor','bad','acting','good'] TRUE
The smaller words themselves don't need to be permuted. The question is whether we can find a ordering of the smaller strings such that if concatenated in that order it gives the larger string
One more constraint - some words from dict[] may also be left over unused
The following gives a complexity O(n2). Any other ways to do this.. so as to improve complexity or increase efficiency in general? Mostly in Java. Thanks in Advance!
bool strPermuteFrom(unordered_set<string> &dict, string &str, int pos)
{
if(pos >= str.size())
return true;
if(dict.size() == 0)
return false;
for(int j = pos; j < str.size(); j++){
string w = str.substr(pos, j - pos + 1);
if(dict.find(w) != dict.end()){
dict.erase(w);
int status = strPermuteFrom(dict, str, j+1);
dict.insert(w);
if(status){
if(pos > 0) str.insert(pos, " ");
return status;
}
}
}
return false;
}
bool strPermute(unordered_set<string> &dict, string &str)
{
return strPermuteFrom(dict, str, 0);
}
The code sample you give doesn't take much advantage of unordered_set (equivalent to Java HashSet's properties); each lookup is O(1), but it has to perform many lookups (for each possible prefix, for the entire length of the string). A std::set (Java TreeSet), being ordered, would allow you to find all possible prefixes at a given point in a single O(log n) lookup (followed by a scan from that point until you were no longer dealing with possible prefixes), rather than stringlength O(1) lookups at each recursive step.
So where this code is doing O(stringlength * dictsize^2) work, using a sorted set structure should reduce it to O(dictsize log dictsize) work. The string length doesn't matter as much, because you no longer lookup prefixes; you look up the remaining string once at each recursive step, and because it's ordered, a matching prefix will sort just before the whole string, no need to check individual substrings. Technically, backtracking would still be necessary (to handle a case where a word in the dict was a prefix of another word in the dict, e.g. act and acting), but aside from that case, you'd never need to backtrack; you'd only ever have a single hit for each recursive step, so you'd just be performing dictsize lookups, each costing log dictsize.

Comparisons of strings with c++

I used to have some code in C++ which stores strings as a series of characters in a character matrix (a string is a row). The classes Character matrix and LogicalVector are provided by Rcpp.h:
LogicalVector unq_mat( CharacterMatrix x ){
int nc = x.ncol() ; // Get the number of columns in the matrix.
LogicalVector out(nc); // Make a logical (bool) vector of the same length.
// For every col in the matrix, assess whether the column contains more than one unique character.
for( int i=0; i < nc; i++ ) {
out[i] = unique( x(_,i) ).size() != 1 ;
}
return out;
}
The logical vector identifies which columns contain more than one unique character. This is then passed back to the R language and used to manipulate a matrix. This is a very R way of thinking of doing this. However I'm interested in developing my thinking in C++, I'd like to write something that achieves the above: So finds out which characters in n strings are not all the same, but preferably using the stl classes like std::string. As a conceptual example given three strings:
A = "Hello", B = "Heleo", C = "Hidey". The code would point out that positions/characters 2,3,4,5 are not one value, but position/character 1 (the 'H') is the same in all strings (i.e. there is only one unique value). I have something below that I thought worked:
std::vector<int> StringsCompare(std::vector<string>& stringVector) {
std::vector<int> informative;
for (int i = 0; i < stringVector[0].size()-1; i++) {
for (int n = 1; n < stringVector.size()-1; n++) {
if (stringVector[n][i] != stringVector[n-1][i]) {
informative.push_back(i);
break;
}
}
}
return informative;
}
It's supposed to go through every character position (0 to size of string-1) with the outer loop, and with the inner loop, see if the character in string n is not the same as the character in string n-1. In cases where the character is all the same, for example the H in my hello example above, this will never be true. For cases where the characters in the strings are different the inter loops if statement will be satisfied, the character position recorded, and the inner loop broken out of. I then get a vector out containing the indicies of the characters in the n strings where the characters are not all identical. However these two functions give me different answers. How else can I go through n strings char by char and check they are not all identical?
Thanks,
Ben.
I expected #doctorlove to provide an answer. I'll enter one here in case he does not.
To iterate through all of the elements of a string or vector by index, you want i from 0 to size()-1. for (int i=0; i<str.size(); i++) stops just short of size, i.e., stops at size()-1. So remove the -1's.
Second, C++ arrays are 0-based, so you must adjust (by adding 1 to the value that is pushed into the vector).
std::vector<int> StringsCompare(std::vector<std::string>& stringVector) {
std::vector<int> informative;
for (int i = 0; i < stringVector[0].size(); i++) {
for (int n = 1; n < stringVector.size(); n++) {
if (stringVector[n][i] != stringVector[n-1][i]) {
informative.push_back(i+1);
break;
}
}
}
return informative;
}
A few things to note about this code:
The function should take a const reference to vector, as the input vector is not modified. Not really a problem here, but for various reasons, it's a good idea to declare unmodified input references as const.
This assumes that all the strings are at least as long as the first. If that doesn't hold, the behavior of the code is undefined. For "production" code, you should include a check for the length prior to extracting the ith element of each string.