Finding anagrams in a word list

Finding anagrams in a word list - c++

I have a word list and a file containing a number of anagrams. These anagrams are words found in the word list. I need to develop an algorithm to find the matching words and produce them in an output file. The code I have developed so far has only worked for the first two words. In addition, I can't get the code to play nice with strings containing numbers anywhere in it. Please tell me how I can fix the code.
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main (void)
{
int x = 0, y = 0;
int a = 0, b = 0;
int emptyx, emptyy;
int match = 0;
ifstream f1, f2;
ofstream f3;
string line, line1[1500], line2[50];
size_t found;
f1.open ("wordlist.txt");
f2.open ("file.txt");
f3.open ("output.txt");
while (f1.eof() == 0)
{
getline (f1, line);
line1[x] = line;
x++;
}
while (f2.eof() == 0)
{
getline (f2, line);
line2[y] = line;
y++;
}
//finds position of last elements
emptyx = x-1;
emptyy = y-1;
//matching algorithm
for (y = 0; y <= emptyy; y++)
{
for (x = 0; x <= emptyx; x++)
{
if (line2[y].length() == line1[x].length())
{
for (a = 0; a < line1[x].length(); a++)
{
found = line2[y].find(line1[x][a]);
if (found != string::npos)
{
match++;
line2[y].replace(found, 1, 1, '.');
if (match == line1[x].length())
{
f3 << line1[x] << ", ";
match = 0;
}
}
}
}
}
}
f1.close();
f2.close();
f3.close();
return 0;
}

Step 1: Build an index with a key of the sorted characters in each word in the wordlist and with the value being the the word.
act - cat
act - act
dgo - dog
...
aeeilnppp - pineapple
....
etc...
Step 2: For each anagram you want to find, sort the characters in your anagram word, and then match against the index to retrieve all words from index with matching sorted key.

Trying to improve Mitch Wheat's solution:
Storing both sorted order and the word is really not necessary - store only the sorted string for every word in list.
Anyways, when we read a word from the file we have to sort it to find if it is equal to the sorted string - and the index is indexed on sorted string, so it will not help anyways.
Build a 'position independent' hash with words in word list - also store the sorted string in the hash.
For every word in file, get the 'position independent' hash and check in hashtable.
If hit, sort and compare to every sorted string stored at this position in hash (collisions!).
Thoughts?

Related

Max Length Substring without any repeating character

Given a String, find the length of longest substring without any repeating character.
Example 1:
Input: s = ”abcabcbb”
Output: 3
Explanation: The answer is abc with length of 3.
Example 2:
Input: s = ”bbbbb”
Output: 1
Explanation: The answer is b with length of 1 units.
My solution works, but it isn't optimised. How can this be done in O(n) time?
#include<bits/stdc++.h>
using namespace std;
int solve(string str) {
if(str.size()==0)
return 0;
int maxans = INT_MIN;
for (int i = 0; i < str.length(); i++) // outer loop for traversing the string
{
unordered_set < int > set;
for (int j = i; j < str.length(); j++) // nested loop for getting different string starting with str[i]
{
if (set.find(str[j]) != set.end()) // if element if found so mark it as ans and break from the loop
{
maxans = max(maxans, j - i);
break;
}
set.insert(str[j]);
}
}
return maxans;
}
int main() {
string str = "abcsabcds";
cout << "The length of the longest substring without repeating characters is " <<
solve(str);
return 0;
}

Use a two pointer approach along with a hashmap here.
Initialise two pointers i = 0, j = 0 (i and j denote the left and right boundary of the current substring)
If the j-th character is not in the map, we can extend the substring. Add the j-th char to the map and increment j.
If the j-th character is in the map, we can not extend the substring without removing the earlier occurrence of the character. Remove the i-th char from the map and increment i.
Repeat this while j < length of string
This will have a time and space complexity of O(n).

#include <string>
#include <iostream>
#include <vector>
int main() {
// 1
std::string s;
std::cin >> s;
// 2
std::vector<int> lut(127, -1);
int i, beg{ 0 }, len_curr{ 0 }, len_ans{ 0 };
for (i = 0; i != s.size(); ++i) {
if (lut[s[i]] == -1 || lut[s[i]] < beg) {
++len_curr;
}
else {
if (len_curr > len_ans) {
len_ans = len_curr;
}
beg = lut[s[i]] + 1;
len_curr = i - lut[s[i]];
}
lut[s[i]] = i;
}
if (len_curr > len_ans) {
len_ans = len_curr;
}
// 3
std::cout << len_ans << '\n';
return 0;
}
In // 1 you:
Define and read your string s.
In // 2 you:
Define your look up table lut, which is a vector of int and consists of 127 buckets each initialized with -1. As per this article there are 95 printable ASCII characters numbered 32 to 126 hence we allocated 127 buckets. lut[ch] is the position in s where you found the character ch for the last time.
Define i (index variable for s), beg (the position in s where your current substring begin at), len_curr (the length of your current substring), len_ans (the length you are looking for).
Loop through s. If you have never found the character s[i] before OR you have found it but at a position BEFORE beg (It belonged to some previous substring in s) you increment len_curr. Otherwise you have a repeating character ! You compare len_curr against len_ans and If needed you assign. Your new beg will be the position AFTER the one you found your repeating character for the last time at. Your new len_curr will be the difference between your current position in s and the position that you found your repeating character for the last time at.
You assign i to lut[s[i]] which means that you found the character s[i] for the last time at position i.
You repeat the If clause when you fall through the loop because your longest substring can be IN the end of s.
In // 3 you:
Print len_ans.

c++ count the number of occurrences of each word in string using an array?

I currently have a function that does what the question says(counts the number of occurrences of each word in string) However it uses map. This is for a university level task and we arent allowed to use maps for the count(something i didnt read haha)
void wordCount(std::string wordFile)
{
std::map<std::string, int> M;
std::string word = "";
for (int i = 0; i < str.size(); i++)
{
if (str[i] == ' ')
{
if (M.find(word) == M.end())
{
M.insert(make_pair(word, 1));
word = "";
}
else
{
M[word]++;
word = "";
}
}
else
word += str[i];
}
if (M.find(word) == M.end())
M.insert(make_pair(word, 1));
else
M[word]++;
for (auto &it : M)
{
std::cout << it.first << ": Occurs "
<< it.second
<< std::endl;
}
}
So my quesiton is, is there a way to do the above but using arrays and not map?

Yes, it can be done, but also yes, you normally want to do it somewhat (okay, completely) differently.
I'd do it in phases:
break the input string up into words and create a vector of individual words
and possibly do a bit of massaging, such as converting the all to lower case
Sort the vector of words
Starting from the first word in the list, count the number of following identical words
And add the {word:count} result to your output
Repeat that starting from the first word that mismatched its predecessor
(for a bonus) sort the vector of {word:count} pairs by the count (probably in descending order), so you get a count like:
the: 583
a: 428
an: 422
// ...
dioxide: 2
arthurian: 1

A good solution since you are forbidden to use maps would simply be to use two vectors, one for the words and another for the number of occurrences of each word, with matching indexes.
When it comes down to it it works pretty much exactly like a map, but since you have that arbitrary restriction that would be the alternative.
To implement this, just replace M by the two vectors, let's call them wordVector and countVector. You can then insert the words like this to make sure the indexes match and the counts are good:
if (str[i] == ' ')
{
if (wordVector.find(word) == wordVector.end())
{
wordVector.insert(word);
countVector.insert(1);
word = "";
}
else
{
int index = wordVector.find(word)
++countVector[index];
word = "";
}
}

Checking if items from a particular txt file agree to constraints in c++ - Name That Number USACO

I have got some doubts while solving - Name That Number.
It goes like this -
Among the large Wisconsin cattle ranchers, it is customary to brand cows with serial numbers to please the Accounting Department. The cowhands don't appreciate the advantage of this filing system, though, and wish to call the members of their herd by a pleasing name rather than saying, "C'mon, #4734, get along."
Help the poor cowhands out by writing a program that will translate the brand serial number of a cow into possible names uniquely associated with that serial number. Since the cowhands all have cellular saddle phones these days, use the standard Touch-Tone(R) telephone keypad mapping to get from numbers to letters (except for "Q" and "Z"):
2: A,B,C 5: J,K,L 8: T,U,V
3: D,E,F 6: M,N,O 9: W,X,Y
4: G,H,I 7: P,R,S
Acceptable names for cattle are provided to you in a file named "dict.txt", which contains a list of fewer than 5,000 acceptable cattle names (all letters capitalized). Take a cow's brand number and report which of all the possible words to which that number maps are in the given dictionary which is supplied as dict.txt in the grading environment (and is sorted into ascending order).
For instance, brand number 4734 produces all the following names:
GPDG GPDH GPDI GPEG GPEH GPEI GPFG GPFH GPFI GRDG GRDH GRDI
GREG GREH GREI GRFG GRFH GRFI GSDG GSDH GSDI GSEG GSEH GSEI
GSFG GSFH GSFI HPDG HPDH HPDI HPEG HPEH HPEI HPFG HPFH HPFI
HRDG HRDH HRDI HREG HREH HREI HRFG HRFH HRFI HSDG HSDH HSDI
HSEG HSEH HSEI HSFG HSFH HSFI IPDG IPDH IPDI IPEG IPEH IPEI
IPFG IPFH IPFI IRDG IRDH IRDI IREG IREH IREI IRFG IRFH IRFI
ISDG ISDH ISDI ISEG ISEH ISEI ISFG ISFH ISFI
As it happens, the only one of these 81 names that is in the list of valid names is "GREG".
Write a program that is given the brand number of a cow and prints all the valid names that can be generated from that brand number or ``NONE'' if there are no valid names. Serial numbers can be as many as a dozen digits long.
Here is what I tried to solve this problem. Just go through all the names in the list and check which is satisfying the constraints given.
int numForChar(char c){
if (c=='A'||c=='B'||c=='C') return 2;
else if(c=='D'||c=='E'||c=='F') return 3;
else if(c=='G'||c=='H'||c=='I') return 4;
else if(c=='J'||c=='K'||c=='L') return 5;
else if(c=='M'||c=='N'||c=='O') return 6;
else if(c=='P'||c=='R'||c=='S') return 7;
else if(c=='T'||c=='U'||c=='V') return 8;
else if(c=='W'||c=='X'||c=='Y') return 9;
else return 0;
int main(){
ios::sync_with_stdio(0);
cin.tie(0);
freopen("namenum.in","r",stdin);
freopen("namenum.out","w",stdout);
string S; cin >> S;
int len = S.length();
freopen("dict.txt","r",stdin);
string x;
while(cin >> x){
string currName = x;
if(currName.length() != S.length()) continue;
string newString = x;
for(int i=0;i<len;i++){
//now encode the name as a number according to the rules
int num = numForChar(currName[i]);
currName[i] = (char)num;
}
if(currName == S){
cout << newString << "\n";
}
}
return 0;
}
Unfortunately, when I submit it to the judge, for some reason, it says no output produced that is my program created an empty output file. What's possibly going wrong?
Any help would be much appreciated. Thank You.
UPDATE: I tried what Some Programmer Dude suggested by adding a statement else return 0; at the end of the numOfChar function in case of a different alphabet. Unfortunately, it didn't work.

So after looking further at the question and exploring the information for Name That Number. I realized that it is not a current contest, and just a practice challenge. Thus, I updated my answer and also giving you my version of a successful submission. Nonetheless, that is a spoiler and will be posted after why your code was not working.
First, you forgot a } after the declaration of your number function. Secondary, you did not implement anything to check whether if the input fail to yield a valid name. Third, when you use numForChar() on the character of currName, the function yielded an integer value. That is not a problem, the problem is that it is not the ASCII code but is a raw number. You then compare that against a character of the input string. Of which, is an ASCII's value of a digit. Thus, your code can't never find a match. To fix that you can just add 48 to the return value of the numForChar() function or xor the numForChar() return's value to 48.
You are on the right track with your method. But there is a few hints. If you are bored you can always skip to the spoiler. You don't need to use the numForChar() function to actually get a digit value from a character. You can just use a constant array. A constant array is faster than that many if loop.
For example, you know that A, B, C will yield two and A's ASCII code is 65, B's is 66, and C's equal to 67. For that 3, you can have an array of 3 indexes, 0, 1, 2 and all of them stores a 2. Thus, if you get B, you subtract B's ASCII code 65 will yield 1. That that is the index to get the value from.
For getting a number to a character you can have a matrix array of char instead. Skip the first 2 index, 0 and 1. Each first level index, contain 3 arrays of 3 characters that are appropriate to their position.
For dictionary comparing, it is right that we don't need to actually look at the word if the length are unequal. However, besides that, since their dictionary words are sorted, if the word's first letter is lower than the range of the input first letter, we can skip that. On the other hand, if words' first letter are now higher than the highest of the input first letter, there isn't a point in continue searching. Take note that my English for code commenting are almost always bad unless I extensively document it.
Your Code(fixed):
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int numForChar(char c){
if (c=='A'||c=='B'||c=='C') return 2;
else if(c=='D'||c=='E'||c=='F') return 3;
else if(c=='G'||c=='H'||c=='I') return 4;
else if(c=='J'||c=='K'||c=='L') return 5;
else if(c=='M'||c=='N'||c=='O') return 6;
else if(c=='P'||c=='R'||c=='S') return 7;
else if(c=='T'||c=='U'||c=='V') return 8;
else if(c=='W'||c=='X'||c=='Y') return 9;
else return 0;
}
int main(){
ios::sync_with_stdio(0);
cin.tie(0);
ifstream fin("namenum.in");
ifstream dict("dict.txt");
ofstream fout("namenum.out");
string S;
fin >> S;
int len = S.length();
bool match = false;
string x;
while(dict >> x){
string currName = x;
if(currName.length() != S.length()) continue;
string newString = x;
for(int i=0;i<len;i++){
//now encode the name as a number according to the rules
int num = numForChar(currName[i]) ^ 48;
currName[i] = (char)num;
}
if(currName == S){
fout << newString << "\n";
match = true;
}
}
if ( match == false ){
fout << "NONE" << endl;
}
return 0;
}
Spoiler Code(Improved):
#include <fstream>
#include <string>
using namespace std;
// A = 65
// 65 - 0 = 65
const char wToN[] = {
// A ,B ,C ,D ,E ,F ,G ,H ,I ,
'2','2','2','3','3','3','4','4','4',
// J ,K ,L ,M ,N ,O ,P ,Q ,R ,S
'5','5','5','6','6','6','7','7','7','7',
// T ,U ,V ,W ,X ,Y ,Z
'8','8','8','9','9','9','9'
};
// 2 = {A, B, C} = 2[0] = A, 2[1] = B, 2[2] C
const char nToW[10][3] = {
{}, // 0 skip
{}, // 1
{'A','B','C'},
{'D','E','F'},
{'G','H','I'},
{'J','K','L'},
{'M','N','O'},
{'P','R','S'},
{'T','U','V'},
{'W','X','Y'}
};
int main(){
ifstream fin("namenum.in");
ifstream dict("dict.txt");
ofstream fout("namenum.out");
string S;
fin >> S;
// Since this will not change
// make this a const to make it
// run faster.
const int len = S.length();
// lastlen is last Index of length
// We calculate this value here,
// So we do not have to calculate
// it for every loop.
const int lastLen = len - 1;
int i = 0;
unsigned char digits[len];
unsigned char firstLetter[3];
// If not match print None
bool match = false;
for ( ; i < len; i++ ){
// No need to check upper bound
// constrain did not call for check.
if ( S[i] < '2' ) {
fout << "NONE" << endl;
return 0;
}
}
const char digit1 = S[0] ^ 48;
// There are 3 set of first letter.
// We get them by converting digits[0]'s
// value using the nToW array.
firstLetter[0] = nToW[digit1][0];
firstLetter[1] = nToW[digit1][1];
firstLetter[2] = nToW[digit1][2];
string dictStr;
while(dict >> dictStr){
// For some reason, when keeping the i = 0 here
// it seem to work faster. That could be because of compiler xor.
i = 0;
// If it is higher than our range
// then there is no point contineuing.
if ( dictStr[0] > firstLetter[2] ) break;
// Skip if first character is lower
// than our range. or If they are not equal in length
if ( dictStr[0] < firstLetter[0] || dictStr.length() != len ) continue;
// If we are in the letter range
// we always check the second letter
// not the first, since we skip the first
i = 1;
for ( int j = 1; j < len; j++ ){
// We convert each letter in the word
// to the corresponding int value
// by subtracting the word ASCII value
// to 65 and use it again our wToN array.
// if it does not match the digits at
// this current position we end the loop.
if ( wToN[dictStr[i] - 65] != S[j] ) break;
// if we get here and there isn't an unmatch then it is a match.
if ( j == lastLen ) {
match = true;
fout << dictStr << endl;
break;
}
i++;
}
}
// No match print none.
if ( match == false ){
fout << "NONE" << endl;
}
return 0;
}

I suggest you use c++ file handling. Overwriting stdin and stdout doesn't seem appropriate.
Add these,
std::ifstream dict ("dict.txt");
std::ofstream fout ("namenum.out");
std::ifstream fin ("namenum.in");
Accordingly change,
cin >> S --to--> fin >> S;
cin >> x --to--> dict >> x
cout << newString --to--> fout << newString

String having maximum number of given substrings made after swapping some characters?

So, this is an interview question that I was going through.
I have strings a, b, and c. I want to obtain string k by swapping some letters in a, so that k should contain as many non-overlapping substrings equal either to b or c as possible. Substring of string x is a string formed by consecutive segment of characters from x. Two substrings of string x overlap if there is position i in string x occupied by both of them.
Input: The first line contains string a, the second line contains string b, and the third line contains string c (1 ≤ |a|, |b|, |c| ≤ 10^5, where |s| denotes the length of string s).
All three strings consist only of lowercase English letters.
It is possible that b and c coincide.
Output: Find one of possible strings k.
Example:
I/P
abbbaaccca
ab
aca
O/P
ababacabcc
this optimal solutions has three non-overlaping substrings equal to either b or c on positions 1 – 2 (ab), 3 – 4 (ab), 5 – 7 (aca).
Now, the approach that I could think of was to make a character count array for each of the strings, and then proceed ahead. Basically, iterate over the original string (a), check for occurences of b and c. If not there, swap as many characters as possible to make either b or c (whichever is shorter). But, clearly this is not the optimal approach.
Can anyone suggest something better? (Only pseudocode will be enough)
Thanks!

First thing is you'll need to do is count the number of occurrences of each character of each string. The occurrences count of a will be your knapsack, whom you'll need to fill with as many b's or c's.
Note that when I say knapsack I mean the character count vector of a, and inserting b to a will mean reducing the character count vector of a by the character count vector of b.
I'm a little bit short with my mathematical prove, but you'll need to
insert as many b as possible to the knapsack
Insert as many c as possible to the knapsack (in the space that left after 1).
If a removal of a b from the knapsack will enable an insertion of more c, remove b from the knapsack. Otherwise, finish.
Fill as many c that you can to the knapsack
Repeat 3-4.
Throughout the program count the number of b and c in the knapsack and the output should be:
[b_count times b][c_count times c][char_occurrence_left_in_knapsack_for_char_x times char_x for each char_x in lower_case_english]
This should solve your problem at O(n).

Assuming that allowed characters have ASCII code 0-127, I would write a function to count the occurence of each character in a string:
int[] count(String s) {
int[] res = new int[128];
for(int i=0; i<res.length(); i++)
res[i] = 0;
for(int i=0; i<a.length(); i++)
res[i]++;
return res;
}
We can now count occurrences in each string:
int aCount = count(a);
int bCount = count(b);
int cCount = count(c);
We can then write a function to count how many times a string can be carved out of characters of another string:
int carveCount(int[] strCount, int[] subStrCount) {
int min = Integer.MAX_VALUE;
for(int i=0; i<subStrCount.length(); i++) {
if (subStrCount[i] == 0)
continue;
if (strCount[i] >= subStrCount[i])
min = Math.min(min, strCount[i]-subStrCount[i]);
else {
return 0;
}
}
for(int i=0; i<subStrCount.length(); i++) {
if (subStrCount[i] != 0)
strStrCount[i] -= min;
}
return min;
}
and call the function:
int bFitCount = carve(aCount, bCount);
int cFitCount = carve(aCount, cCount);
EDIT: I didn't realize you wanted all characters originally in a, fixing here.
Finally, to produce the output:
StringBuilder sb = new StringBuilder();
for(int i=0; i<bFitCount; i++) {
sb.append(b);
for(int i=0; i<cFitCount; i++) {
sb.append(c);
for(int i=0; i<aCount.length; i++) {
for(int j=0; j<aCount[i]; j++)
sb.append((char)i);
}
return sb.toString();
One more comment: if the goal is to maximize the number of repetitions(b)+repetitions(c), then you may want to first swab b and c if c is shorter. This way if they share some characters you have better chance of increasing the result.
The algorithm could be optimized further, but as it is it should have complexity O(n), where n is the sum of the length of the three strings.

A related problem is called Knapsack problem.
This is basically the solution described by #Tal Shalti.
I tried to keep everything readable.
My program return abbcabacac as one of the string with the most occurences (3).
To get all permutations without repeating a permutation I use std::next_permutation from algorithm. There not much happening in the main function. I only store the number of occurrences and the permutation, if a higher number of occurrences was achieved.
int main()
{
std::string word = "abbbaaccca";
std::string patternSmall = "ab";
std::string patternLarge = "aca";
unsigned int bestOccurrence = 0;
std::string bestPermutation = "";
do {
// count and remove occurrence
unsigned int occurrences = FindOccurences(word, patternLarge, patternSmall);
if (occurrences > bestOccurrence) {
bestOccurrence = occurrences;
bestPermutation = word;
std::cout << word << " .. " << occurences << std::endl;
}
} while (std::next_permutation(word.begin(), word.end()));
std::cout << "Best Permutation " << bestPermutation << " with " << bestOccurrence << " occurrences." << std::endl;
return 0;
}
This function handles the basic algorithm. pattern1 is the longer pattern, so it will be searched for last. If a pattern is found, it will be replaced with the string "##", since this should be very rare in the English language.
The variable occurrenceCounter keeps track of the number of found occurences.
unsigned int FindOccurrences(const std::string& word, const std::string& pattern1, const std::string& pattern2)
{
unsigned int occurrenceCounter = 0;
std::string tmpWord(word);
// '-1' makes implementation of while() easier
std::string::size_type i = -1;
i = -1;
while (FindPattern(tmpWord, pattern2, ++i)) {
occurrenceCounter++;
tmpWord.replace(tmpWord.begin() + i, tmpWord.begin() + i + pattern2.size(), "##");
}
i = -1;
while (FindPattern(tmpWord, pattern1, ++i)) {
occurrenceCounter++;
tmpWord.replace(tmpWord.begin() + i, tmpWord.begin() + i + pattern1.size(), "##");
}
return occurrenceCounter;
}
This function returns the first position of the found pattern. If the pattern is not found, std::string::npos is returned by string.find(...). Also string.find(...) starts to search for the pattern starting by index i.
bool FindPattern(const std::string& word, const std::string& pattern, std::string::size_type& i)
{
std::string::size_type foundPosition = word.find(pattern, i);
if (foundPosition == std::string::npos) {
return false;
}
i = foundPosition;
return true;
}

Finding substring inside string with any order of characters of substring in C/C++

Suppose I have a string "abcdpqrs",
now "dcb" can be counted as a substring of above string as the characters are together.
Also "pdq" is a part of above string. But "bcpq" is not. I hope you got what I want.
Is there any efficient way to do this.
All I can think is taking help of hash to do this. But it is taking long time even in O(n) program as backtracking is required in many cases. Any help will be appreciated.

Here is an O(n * alphabet size) solution:
Let's maintain an array count[a] = how many times the character a was in the current window [pos; pos + lenght of substring - 1]. It can be recomputed in O(1) time when the window is moved by 1 to the right(count[s[pos]]--, count[s[pos + substring lenght]]++, pos++). Now all we need is to check for each pos that count array is the same as count array for the substring(it can be computed only once).
It can actually be improved to O(n + alphabet size):
Instead of comparing count arrays in a naive way, we can maintain the number diff = number of characters that do not have the same count value as in a substring for the current window. The key observation is that diff changes in obvious way we apply count[c]-- or count[c]++ (it either gets incremented, decremented or stays the same depending on only count[c] value). Two count arrays are the same if and only if diff is zero for current pos.

Lets say you have the string "axcdlef" and wants to search "opde":
bool compare (string s1, string s2)
{
// sort both here
// return if they are equal when sorted;
}
you would need to call this function for this example with the following substrings of size 4(same as length as "opde"):
"axcd"
"xcdl"
"cdle"
"dlef"
bool exist = false;
for (/*every split that has the same size as the search */)
exist = exist || compare(currentsplit, search);

You can use a regex (i.e boost or Qt) for this. Alternately you an use this simple approach. You know the length k of the string s to be searched in string str. So take each k consecutive characters from str and check if any of these characters is present in s.
Starting point ( a naive implementation to make further optimizations):
#include <iostream>
/* pos position where to extract probable string from str
* s string set with possible repetitions being searched in str
* str original string
*/
bool find_in_string( int pos, std::string s, std::string str)
{
std::string str_s = str.substr( pos, s.length());
int s_pos = 0;
while( !s.empty())
{
std::size_t found = str_s.find( s[0]);
if ( found!=std::string::npos)
{
s.erase( 0, 1);
str_s.erase( found, 1);
} else return 0;
}
return 1;
}
bool find_in_string( std::string s, std::string str)
{
bool found = false;
int pos = 0;
while( !found && pos < str.length() - s.length() + 1)
{
found = find_in_string( pos++, s, str);
}
return found;
}
Usage:
int main() {
std::string s1 = "abcdpqrs";
std::string s2 = "adcbpqrs";
std::string searched = "dcb";
std::string searched2 = "pdq";
std::string searched3 = "bcpq";
std::cout << find_in_string( searched, s1);
std::cout << find_in_string( searched, s2);
std::cout << find_in_string( searched2, s1);
std::cout << find_in_string( searched3, s1);
return 0;
}
prints: 1110
http://ideone.com/WrSMeV

To use an array for this you are going to need some extra code to map where each character goes in there... Unless you know you are only using 'a' - 'z' or something similar that you can simply subtract from 'a' to get the position.
bool compare(string s1, string s2)
{
int v1[SIZE_OF_ALFABECT];
int v2[SIZE_OF_ALFABECT];
int count = 0;
map<char, int> mymap;
// here is just pseudocode
foreach letter in s1:
if map doesnt contain this letter already:
mymap[letter] = count++;
// repeat the same foreach in s2
/* You can break and return false here if you try to add new char into map,
that means that the second string has a different character already... */
// count will now have the number of distinct chars that you have in both strs
// you will need to check only 'count' positions in the vectors
for(int i = 0; i < count; i++)
v1[i] = v2[i] = 0;
//another pseudocode
foreach letter in s1:
v1[mymap[leter]]++;
foreach letter in s1:
v2[mymap[leter]]++;
for(int i = 0; i < count; i++)
if(v1[i] != v2[i])
return false;
return true;
}

Here is a O(m) best case, O(m!) worst case solution - m being the length of your search string:
Use a suffix-trie, e.g. a Ukkonnen Trie (there are some floating around, but I have no link at hand at the moment), and search for any permutation of the substring. Note that any lookup needs just O(1) for each chararacter of the string to search, regardless of the size of n.
However, while the size of n does not matter, this becomes inpractical for large m.
If however n is small enough anf one is willing to sacrifice lookup performance for index size, the suffix trie can store a string that contains all permutations of the original string.
Then the lookup will always be O(m).
I'd suggest to go with the accepted answer for the general case. However, here you have a suggestion that can perform (much) better for small substrings and large string.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Finding anagrams in a word list - c++

Related

Max Length Substring without any repeating character

c++ count the number of occurrences of each word in string using an array?

Checking if items from a particular txt file agree to constraints in c++ - Name That Number USACO

String having maximum number of given substrings made after swapping some characters?

Finding substring inside string with any order of characters of substring in C/C++

Categories

Resources