I need to compute the longest common substrings from a set of filenames in C++.
Precisely, I have an std::list of std::strings (or the QT equivalent, also fine)
char const *x[] = {"FirstFileWord.xls", "SecondFileBlue.xls", "ThirdFileWhite.xls", "ForthFileGreen.xls"};
std::list<std::string> files(x, x + sizeof(x) / sizeof(*x));
I need to compute the n distinct longest common substrings of all strings, in this case e.g. for n=2
"File" and ".xls"
If I could compute the longest common subsequence, I could cut it out it and run the algorithm again to get the second longest, so essentially this boils down to:
Is there a (reference?) implementation for computing the LCS of a std::list of std::strings?
This is not a good answer but a dirty solution that I have - brute force on a QList of QUrls from which only the part after the last "/" is taken. I'd love to replace this with "proper" code.
(I have discovered http://www.icir.org/christian/libstree/ - which would help greatly, but I can't get it to compile on my machine. Someone used this maybe?)
QString SubstringMatching::getMatchPattern(QList<QUrl> urls)
{
QString a;
int foundPosition = -1;
int foundLength = -1;
for (int i=urls.first().toString().lastIndexOf("/")+1; i<urls.first().toString().length(); i++)
{
bool hit=true;
int xj;
for (int j=0; j<urls.first().toString().length()-i+1; j++ ) // try to match from position i up to the end of the string :: test character at pos. (i+j)
{
if (!hit) break;
QString firstString = urls.first().toString().right( urls.first().toString().length()-i ).left( j ); // this needs to match all k strings
//qDebug() << "SEARCH " << firstString;
for (int k=1; k<urls.length(); k++) // test all other strings, k = test string number
{
if (!hit) break;
//qDebug() << " IN " << urls.at(k).toString().right(urls.at(k).toString().length() - urls.at(k).toString().lastIndexOf("/")+1);
//qDebug() << " RES " << urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1);
if (urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1)<0) {
xj = j;
//qDebug() << "HIT LENGTH " << xj-1 << " : " << firstString;
hit = false;
}
}
}
if (hit) xj = urls.first().toString().length()-i+1; // hit up to the end of the string
if ((xj-2)>foundLength) // have longer match than existing, j=1 is match length
{
foundPosition = i; // at the current position
foundLength = xj-1;
//qDebug() << "Found at " << i << " length " << foundLength;
}
}
a = urls.first().toString().right( urls.first().toString().length()-foundPosition ).left( foundLength );
//qDebug() << a;
return a;
}
If as you say suffix trees are too heavyweight or otherwise impractical, the following
fairly simple brute-force approach may be adequate for your application.
I assume distinct substrings shall be non-overlapping and are picked from
left to right.
Even with these assumptions, there need not be a unique set that comprises
"the N distinct longest common substrings" of a set of strings. Whatever N is,
there might be more than N distinct common substrings all of the same maximal
length and any choice of N from among them would be arbitrary. Accordingly
the solution finds the at-most N *sets* of the longest distinct common
substrings in which all those of the same length are one set.
The algorithm is as follows:
Q is the target quota of lengths.
Strings is the problem set of strings.
Results is an initially empty multimap that maps a length to a set of strings,
Results[l] being the set with length l
N, initially 0, is the number of distinct lengths represented in Results
If Q is 0 or Strings is empty return Results
Find any shortest member of Strings; keep a copy of it S and remove it
from Strings. We proceed by comparing the substrings of S with those
of Strings because all the common substrings of {Strings, S} must be
substrings of S.
Iteratively generate all the substrings of S, longest first, using the
obvious nested loop controlled by offset and length. For each substring ss of
S:
If ss is not a common substring of Strings, next.
Iterate over Results[l] for l >= the length of ss until end of
Results or until ss is found to be a substring of the examined
result. In the latter case, ss is not distinct from a result already
in hand, so next.
ss is common substring distinct from any already in hand. Iterate over
Results[l] for l < the length of ss, deleting each result that is a
substring of ss, because all those are shorter than ss and not distinct
from it. ss is now a common substring distinct from any already in hand and
all others that remain in hand are distinct from ss.
For l = the length of ss, check whether Results[l] exists, i.e. if
there are any results in hand the same length as ss. If not, call that
a NewLength condition.
Check also if N == Q, i.e. we have already reached the target quota of distinct
lengths. If NewLength obtains and also N == Q, call that a StickOrRaise condition.
If StickOrRaise obtains then compare the length of ss with l = the
length of the shortest results in hand. If ss is shorter than l
then it is too short for our quota, so next. If ss is longer than l
then all the shortest results in hand are to be ousted in favour of ss, so delete
Results[l] and decrement N.
Insert ss into Results keyed by its length.
If NewLength obtains, increment N.
Abandon the inner iteration over substrings of S that have the
same offset of ss but are shorter, because none of them are distinct
from ss.
Advance the offset in S for the outer iteration by the length of ss,
to the start of the next non-overlapping substring.
Return Results.
Here is a program that implements the solution and demonstrates it with
a list of strings:
#include <list>
#include <map>
#include <string>
#include <iostream>
#include <algorithm>
using namespace std;
// Get a non-const iterator to the shortest string in a list
list<string>::iterator shortest_of(list<string> & strings)
{
auto where = strings.end();
size_t min_len = size_t(-1);
for (auto i = strings.begin(); i != strings.end(); ++i) {
if (i->size() < min_len) {
where = i;
min_len = i->size();
}
}
return where;
}
// Say whether a string is a common substring of a list of strings
bool
is_common_substring_of(
string const & candidate, list<string> const & strings)
{
for (string const & s : strings) {
if (s.find(candidate) == string::npos) {
return false;
}
}
return true;
}
/* Get a multimap whose keys are the at-most `quota` greatest
lengths of common substrings of the list of strings `strings`, each key
multi-mapped to the set of common substrings of that length.
*/
multimap<size_t,string>
n_longest_common_substring_sets(list<string> & strings, unsigned quota)
{
size_t nlengths = 0;
multimap<size_t,string> results;
if (quota == 0) {
return results;
}
auto shortest_i = shortest_of(strings);
if (shortest_i == strings.end()) {
return results;
}
string shortest = *shortest_i;
strings.erase(shortest_i);
for ( size_t start = 0; start < shortest.size();) {
size_t skip = 1;
for (size_t len = shortest.size(); len > 0; --len) {
string subs = shortest.substr(start,len);
if (!is_common_substring_of(subs,strings)) {
continue;
}
auto i = results.lower_bound(subs.size());
for ( ;i != results.end() &&
i->second.find(subs) == string::npos; ++i) {}
if (i != results.end()) {
continue;
}
for (i = results.begin();
i != results.end() && i->first < subs.size(); ) {
if (subs.find(i->second) != string::npos) {
i = results.erase(i);
} else {
++i;
}
}
auto hint = results.lower_bound(subs.size());
bool new_len = hint == results.end() || hint->first != subs.size();
if (new_len && nlengths == quota) {
size_t min_len = results.begin()->first;
if (min_len > subs.size()) {
continue;
}
results.erase(min_len);
--nlengths;
}
nlengths += new_len;
results.emplace_hint(hint,subs.size(),subs);
len = 1;
skip = subs.size();
}
start += skip;
}
return results;
}
// Testing ...
int main()
{
list<string> strings{
"OfBitWordFirstFileWordZ.xls",
"SecondZWordBitWordOfFileBlue.xls",
"ThirdFileZBitWordWhiteOfWord.xls",
"WordFourthWordFileBitGreenZOf.xls"};
auto results = n_longest_common_substring_sets(strings,4);
for (auto const & val : results) {
cout << "length: " << val.first
<< ", substring: " << val.second << endl;
}
return 0;
}
Output:
length: 1, substring: Z
length: 2, substring: Of
length: 3, substring: Bit
length: 4, substring: .xls
length: 4, substring: File
length: 4, substring: Word
(Built with gcc 4.8.1)
Related
I am trying to find out the maximum number of words in a sentence (Separated by a dot) from a paragraph. and I am completely stuck into how to sort and output to stdout.
Eg:
Given a string S: {"Program to split strings. By using custom split function. In C++"};
The expected output should be : 5
#define max 8 // define the max string
string strings[max]; // define max string
string words[max];
int count = 0;
void split (string str, char seperator) // custom split() function
{
int currIndex = 0, i = 0;
int startIndex = 0, endIndex = 0;
while (i <= str.size())
{
if (str[i] == seperator || i == str.size())
{
endIndex = i;
string subStr = "";
subStr.append(str, startIndex, endIndex - startIndex);
strings[currIndex] = subStr;
currIndex += 1;
startIndex = endIndex + 1;
}
i++;
}
}
void countWords(string str) // Count The words
{
int count = 0, i;
for (i = 0; str[i] != '\0';i++)
{
if (str[i] == ' ')
count++;
}
cout << "\n- Number of words in the string are: " << count +1 <<" -";
}
//Sort the array in descending order by the number of words
void sortByWordNumber(int num[30])
{
/* CODE str::sort? std::*/
}
int main()
{
string str = "Program to split strings. By using custom split function. In C++";
char seperator = '.'; // dot
int numberOfWords;
split(str, seperator);
cout <<" The split string is: ";
for (int i = 0; i < max; i++)
{
cout << "\n initial array index: " << i << " " << strings[i];
countWords(strings[i]);
}
return 0;
}
Count + 1 in countWords() is giving the numbers correctly only on the first result then it adds the " " whitespace to the word count.
Please take into consideration answering with the easiest solution to understand first. (std::sort, making a new function, lambda)
Your code does not make a sense. For example the meaning of this declaration
string strings[max];
is unclear.
And to find the maximum number of words in sentences of a paragraph there is no need to sort the sentences themselves by the number of words.
If I have understood correctly what you need is something like the following.
#include <iostream>
#include <sstream>
#include <iterator>
int main()
{
std::string s;
std::cout << "Enter a paragraph of sentences: ";
std::getline( std::cin, s );
size_t max_words = 0;
std::istringstream is( s );
std::string sentence;
while ( std::getline( is, sentence, '.' ) )
{
std::istringstream iss( sentence );
auto n = std::distance( std::istream_iterator<std::string>( iss ),
std::istream_iterator<std::string>() );
if ( max_words < n ) max_words = n;
}
std::cout << "The maximum number of words in sentences is "
<< max_words << '\n';
return 0;
}
If to enter the paragraph
Here is a paragraph. It contains several sentences. For example, how to use string streams.
then the output will be
The maximum number of words in sentences is 7
If you are not yet familiar with string streams then you could use member functions find, find_first_of, find_first_not_of with objects of the type std::string to split a string into sentences and to count words in a sentence.
Your use case sounds like a reduction. Essentially you can have a state machine (parser) that goes through the string and updates some state (e.g. counters) when it encounters the word and sentence delimiters. Special care should be given for corner cases, e.g. when having continuous multiple white-spaces or >1 continous full stops (.). A reduction handling these cases is shown below:
int max_words_in(std::string const& str)
{
// p is the current and max word count.
auto parser = [in_space = false] (std::pair<int, int> p, char c) mutable {
switch (c) {
case '.': // Sentence ends.
if (!in_space && p.second <= p.first) p.second = p.first + 1;
p.first = 0;
in_space = true;
break;
case ' ': // Word ends.
if (!in_space) ++p.first;
in_space = true;
break;
default: // Other character encountered.
in_space = false;
}
return p; // Return the updated accumulation value.
};
return std::accumulate(
str.begin(), str.end(), std::make_pair(0, 0), parser).second;
}
Demo
The tricky part is deciding how to handle degenerate cases, e.g. what should the output be for "This is a , ,tricky .. .. string to count" where different types of delimiters alternate in arbitrary ways. Having a state machine implementation of the parsing logic allows you to easily adjust your solution (e.g. you can pass an "ignore list" to the parser and update the default case to not reset the in_space variable when c belongs to that list).
vector<string> split(string str, char seperator) // custom split() function
{
size_t i = 0;
size_t seperator_pos = 0;
vector<string> sentences;
int word_count = 0;
for (; i < str.size(); i++)
{
if (str[i] == seperator)
{
i++;
sentences.push_back(str.substr(seperator_pos, i - seperator_pos));
seperator_pos = i;
}
}
if (str[str.size() - 1] != seperator)
{
sentences.push_back(str.substr(seperator_pos + 1, str.size() - seperator_pos));
}
return sentences;
}
I m trying to write a c++ function to lexicographically compare kth word from two strings. here is my function :
bool kth_lexo ()
{
int k = 2 ;
str1 = "123 300 60009" ;
str2 = "1500 10002" ;
// to store the kth word of fist string in ptr1
char *ptr1 = strtok( (char*)str1.c_str() ," ");
for(int i = 1; i<k; i++)
{
ptr1 = strtok(NULL," ");
}
// to store the kth word of second string in ptr2
char *ptr2 = strtok( (char*)str2.c_str() ," ");
for(int i = 1; i<k; i++)
{
ptr2 = strtok(NULL," ");
}
string st1 = ptr1 ;
string st2 = ptr2 ;
return st1 > st2 ;
}
In this function my lexicographical comparison works fine, as this func returns 1 because 300 (2nd word of str1) is lexicographically bigger than 10002 (2nd word of str2)
My Problem :
If i slightly modify my function by replacing last line of previous function by this return ptr1>ptr2 ;
now my new function lokks something like this :
bool kth_lexo ()
{
int k = 2 ;
str1 = "123 300 60009" ;
str2 = "1500 10002" ;
// to store the kth word of fist string in ptr1
char *ptr1 = strtok( (char*)str1.c_str() ," ");
for(int i = 1; i<k; i++)
{
ptr1 = strtok(NULL," ");
}
// to store the kth word of second string in ptr2
char *ptr2 = strtok( (char*)str2.c_str() ," ");
for(int i = 1; i<k; i++)
{
ptr2 = strtok(NULL," ");
}
// modified line compared to previous function
return ptr1 > ptr2 ;
}
for this modified function each time my output consistently comes out to be 0, no matter whether kth word of str1 stored in ptr1 is lexicographically greater or smaller than kth word of str2 stored in ptr2.
also even after modifying the return statement by this line doesn't bring much help :
return (*ptr1)>(*ptr2) ;
So what's the problem with either of these two return statement lines in my modified function for comparing the kth word of both the strings:
return ptr1 > ptr2 ;
OR
return (*ptr1) > (*ptr2) ;
You are using a very C-like program. Using modern C++ makes this much simpler and easier to read, since we can use very expressive syntax:
#include <string_view>
#include <iostream>
#include <cassert>
auto find_kth_char(std::string_view to_search, char c, std::size_t k, std::size_t pos = 0) {
for (; pos < std::string_view::npos && k > 0; --k) {
pos = to_search.find(c, pos + 1);
}
return pos;
}
auto get_kth_word(std::string_view to_search, std::size_t k) {
// We count starting on 1
assert(k > 0);
auto start = find_kth_char(to_search, ' ', k - 1);
if (start == std::string_view::npos) {
return std::string_view{};
}
auto end = find_kth_char(to_search, ' ', 1, start);
return to_search.substr(start, end - start);
}
auto compare_kth(std::string_view lhs, std::string_view rhs, std::size_t k) {
auto l_word = get_kth_word(lhs, k);
auto r_word = get_kth_word(rhs, k);
// returnvalue <=> 0 == lhs <=> rhs
return l_word.compare(r_word);
}
int main() {
auto str1 = "123 300 60009";
auto str2 = "1500 10002";
for (std::size_t k = 1; k < 4; ++k) {
std::cout << k << ":\t" << compare_kth(str1, str2, k) << '\n';
}
}
I am using C++17's string_view since we do not change anything in the strings and taking substrings etc. is very cheap with them. We use the find and compare member functions for doing the real work.
The return value from our function is an int that tells us whether the left hand side is smaller (negative result), equal (0) or greater (positve result) than the right hand side.
If you would stop using C and consequently use C++, then this problem would not occur.
You are here mixing up C++ std::string and char* or const char*. Basically, for strings, std::string is that superior to the old style C-char-arrays or char* that you from now on and in the future should never use something else than std::string
A char pointer is an adress into some area in the memory, where your char data is stored. Dereferencing the pointer with *, will give you the element stored at this address. So only one element. Not a string or whatever. Only exactly one character.
comparing ptr1 > ptr2 , will not compare strings. It will compare some values, where the strings are stored in memory. "ptr1" could be 0x578962574 and "ptr2" could be 0x95324782, or whatever. We do not know the address. This will be defined by the linker.
And if you compare (*ptr1)>(*ptr2), then you compare only 2 singgle characters, and that may give you also the wrong result.
On the other hand, Comparing 2 std::strings, will always work as expected.
So, simple answer: Use std::string for all strings.
The aim of the function is to find out the longest and not repeating substring, so I need to find out the start position of the substring and the length of it. The thing I'm struggling with is the big O notation should be O(n). Therefore I cannot use nested for loops to check whether each letter is repeated.
I created a struct function like this but I don't know how to continue:
struct Answer {
int start;
int length;
};
Answer findsubstring(char *string){
Answer sub={0, 0}
for (int i = 0; i < strlen(string); i++) {
}
return (sub)
}
For example, the input is HelloWorld, and the output should be World.The length is 5.
If the input isabagkfleoKi, then the output is bagkfleoKi. The length is 10.
Also, if the length of two strings is the same, pick the latter one.
Use a std::unordered_map<char, size_t> to store the indices past the last occurance of a certain char.
Keep the currently best match as well as the match you currently test. Iterating through the chars of the input result in 2 cases you need to handle:
the char already occured and the last occurance of the char requires you to move the start of the potential match to avoid the char from occuring twice: Update the answer with the match ending just before the current char, if that's better than the current answer.
Otherwise: Just update the map
void printsubstring(const char* input)
{
std::unordered_map<char, size_t> lastOccurances;
Answer answer{ 0, 0 };
size_t currentPos = 0;
size_t currentStringStart = 0;
char c;
while ((c = input[currentPos]) != 0)
{
auto entry = lastOccurances.insert({ c, currentPos + 1 });
if (!entry.second)
{
if (currentStringStart < entry.first->second && currentPos - currentStringStart > answer.length)
{
// need to move the start of the potential answer
// -> check, if the match up to the char before the current char was better
answer.start = currentStringStart;
answer.length = currentPos - currentStringStart;
currentStringStart = entry.first->second;
}
entry.first->second = currentPos + 1;
}
++currentPos;
}
// check the match ending at the end of the string
if (currentPos - currentStringStart > answer.length)
{
answer.start = currentStringStart;
answer.length = currentPos - currentStringStart;
}
std::cout << answer.start << ", " << answer.length << std::endl;
std::cout << std::string_view(input + answer.start, answer.length) << std::endl;
}
I'll outline one possible solution.
You'll need two loops. One for pointing at the start of the substring and one that points at the end.
auto stringlen = std::strlen(string);
for(size_t beg = 0; beg < stringlen - sub.length; ++beg) {
// See point 2.
for(size_t end = beg; end < stringlen; ++end) {
// See point 3.
}
}
Create a "blacklist" of characters already seen in the substring.
bool blacklist[1 << CHAR_BIT]{}; // zero initialized
Check if the current end character is already in the blacklist and break out of the loop if it is, otherwise, put it in the blacklist.
if(blacklist[ static_cast<unsigned char>(string[end]) ]) break;
else {
blacklist[ static_cast<unsigned char>(string[end]) ] = true;
// See point 4.
}
Check if the length of the current substring (end - beg + 1) is greater than the longest you currently have (sub.length). If it is longer, store sub.start = beg and sub.length = end - beg + 1
Demo and Demo using a bitset<> instead
Doing some work with timing different algorithms, however my brute force implementation which I have found numerous times on different sites is sometimes returning more results than, say, Notepad++ search or VSCode search. Not sure what I am doing wrong.
The program opens a txt file with a DNA strand string of length 10000000 and searches and counts the number of occurrences of the string passed in via command line.
Algorithm:
int main(int argc, char *argv[]) {
// read in dna strand
ifstream file("dna.txt");
string dna((istreambuf_iterator<char>(file)), istreambuf_iterator<char>());
dna.c_str();
int dnaLength = dna.length();
cout << "DNA Strand Length: " << dnaLength << endl;
string pat = argv[1];
cout << "Pattern: " << pat << endl;
// algorithm
int M = pat.length();
int N = dnaLength;
int localCount = 0;
for (int i = 0; i <= N - M; i++) {
int j;
for (j = 0; j < M; j++) {
if (dna.at(i + j) != pat.at(j)) {
break;
}
}
if (j == M) {
localCount++;
}
}
The difference might be because your algorithm also counts overlapping results, while a quick check with Notepad++ shows that it does not.
Example:
Let dna be "FooFooFooFoo"
And your pattern "FooFoo"
What result do you expect? Notepad++ shows 2 (one starts at position 1, the second at position 7 (after the first).
Your algorithm will find 3 (position 1, 4 and 7)
In your algorithm, the index i increase by 1 every loop. This may cause double counting for some searching pattern. For eaxample, search for ABAB in the text ... ABABABABABAB .... The answer may be 5 times in your methods, and it would be 3 times if each character is not allowed to be double counted. Which answer you want?
To avoid double counting, you may rewrite the index i to a while loop:
i = 0;
while (i < M)
{
for (j = 0; j < M; j++) {
if (dna.at(i + j) != pat.at(j)) {
break;
}
}
if (j == M) {
localCount++;
i += M;
}
else ++i;
}
Or, you can employ the function std::string::find(const string&, int p=0). The first argument is the pattern to look for, and the second the position to start search:
int pos = 0, count=0;
pos = dna.find(pat); // initial serach start from pos=0;
while( pos != std::string::npos) { // while not end of string
++count;
pos = dna.find(pat, pos + M); // start search from pos+M
}
These two methods provide a self-confirmation for confidence.
Given a string, what's the most optimized solution to find the maximum number of equal substrings? For example "aaaa" is composed of four equal substrings "a", or "abab" is composed of two "ab"s. But for something as "abcd" there isn't any substrings but "abcd" that when concatenated to itself would make up "abcd".
Checking all the possible substrings isn't a solution since the input can be a string of length 1 million.
Since there is no given condition for the substrings, an optimized solution to find the maximum number of equal substrings is to count the shortest possible strings, letters. Create a map and count the letters of the string. Find the letter with the maximum number. That is your solution.
EDIT:
If the string must only consist of the substrings then the following code computes a solution
#include <iostream>
#include <string>
using ull = unsigned long long;
int main() {
std::string str = "abab";
ull length = str.length();
for (ull i = 1; (2 * i) <= str.length() && str.length() % i == 0; ++i) {
bool found = true;
for (ull j = 1; (j * i) < str.length(); ++j) {
for (ull k = 0; k < i; ++k) {
if(str[k] != str[k + j * i]) {
found = false;
}
}
}
if(found) {
length = i;
break;
}
}
std::cout << "Maximal number: " << str.length() / length << std::endl;
return 0;
}
This algorithm checks if the head of the string is repeated and if the string only consists of repetitions of the head.
i-loop iterates over the length of the head,
j-loop iterates over each repetition,
k-loop iterates over each character in the substring