Suppose I have a collection of substrings, for example:
string a = {"cat","sensitive","ate","energy","tense"}
Then the output for this should be as follows:
catensesensitivenergy
How can I do this?
This problem is known as the shortest common superstring problem and it is NP-hard, so if you need an exact solution you cannot do much better then trying all possibilities and choosing the best one.
One possible exponential solution is to generate all permutations of the input strings, find the shortest common superstring greedily for each permutation(a permutation specifies the order of strings and it is possible to prove that for a fixed order greedy algorithm always works correctly) and choose the best one.
Using user2040251 suggestion:
#include <string>
#include <iostream>
#include <algorithm>
std::string merge_strings( const std::vector< std::string > & pool )
{
std::string retval;
for( auto s : pool )
if( retval.empty() )
retval.append( s );
else if( std::search( retval.begin(), retval.end(), s.begin(), s.end() ) == retval.end() )
{
size_t len = std::min( retval.size(), s.size() );
for( ; len; --len )
if( retval.substr( retval.size() - len ) == s.substr( 0, len ) )
{
retval.append( s.substr( len ) );
break;
}
if( !len )
retval.append( s );
}
return retval;
}
std::string shortest_common_supersequence( std::vector< std::string > & pool )
{
std::sort( pool.begin(), pool.end() );
std::string buffer;
std::string best_reduction = merge_strings( pool );
while( std::next_permutation( pool.begin(), pool.end() ) )
{
buffer = merge_strings( pool );
if( buffer.size() < best_reduction.size() )
best_reduction = buffer;
}
return best_reduction;
}
int main( int argc, char ** argv )
{
std::vector< std::string > a{"cat","sensitive","ate","energy","tense"};
std::vector< std::string > b{"cat","sensitive","ate","energy","tense","sit"};
std::vector< std::string > c{"personal","ate","energy","tense","gyroscope"};
std::cout << "best a --> \"" << shortest_common_supersequence( a ) << "\"\n";
std::cout << "best b --> \"" << shortest_common_supersequence( b ) << "\"\n";
std::cout << "best c --> \"" << shortest_common_supersequence( c ) << "\"\n";
return 0;
}
Output:
best a --> "catensensitivenergy"
best b --> "catensensitivenergy"
best c --> "atensenergyroscopersonal"
Break the problem down and see what we got. Starting with only two strings. We must check which suffix of one string is the longest prefix of the other. This gives us the order for the best concatenation.
Now, with a set of n word, how do we do ? We start by building a trie containing every word (a key for each word). If a word is a duplicate of an other we can easily flag it as such while building the prefix tree.
I made a quick implementation of a regular Trie. You can find it here.
We have the tools to build a directed graph linking the different words wether a suffix of the first is a prefix of the second. The weight of the edge is the length of the suffix.
To do so, for each word w of the input set, we must see which words we can reach with a suffix of w :
We walk down the trie using the suffix. We will end up in a node (or not).
From this node, provided it exists, we scan the remaining subtree to see which words are
available.
If a given suffix of length l yields a match with a
prefix of word w', then we add an edge w → w', with weight length(w') - l.
If such an edge already exists, we just update the weight to keep the lowest.
From there, the graph is set and we must find the shortest path that runs through every vertex (eg. word) only once. If the graph is complete, this is the Traveling Salesman Problem. Most of the times, the graph won't be complete.
Still, it remains a NP-hard problem. In more "technical" terms, the problem at hand is to find the shortest hamiltonian path of a digraph.
Note : Given an Hamiltonian path (if it exists) with its cost C, and its starting vertex (word) W, the supestring length is given by :
Lsuper = LW + C
Note : If two words have no suffix linking them to another word, then the graph is not connected and there is no hamiltonian path.
Related
My currently problem is the following:
I have a std::vector of full path names to files.
Now i want to cut off the common prefix of all string.
Example
If I have these 3 strings in the vector:
/home/user/foo.txt
/home/user/bar.txt
/home/baz.txt
I would like to cut off /home/ from every string in the vector.
Question
Is there any method to achieve this in general?
I want an algorithm that drops the common prefix of all string.
I currently only have an idea which solves this problem in O(n m) with n strings and m is the longest string length, by just going through every string with every other string char by char.
Is there a faster or more elegant way solving this?
This can be done entirely with std:: algorithms.
synopsis:
sort the input range if not already sorted. The first and last paths in the sorted range
will be the most dissimilar. Best case is O(N), worst case O(N + N.logN)
use std::mismatch to determine the larges common sequence between the
two most dissimilar paths [insignificant]
run through each path erasing the first COUNT characters where COUNT is the number of characters in the longest common sequence. O (N)
Best case time complexity: O(2N), worst case O(2N + N.logN) (can someone check that?)
#include <iostream>
#include <algorithm>
#include <string>
#include <vector>
std::string common_substring(const std::string& l, const std::string& r)
{
return std::string(l.begin(),
std::mismatch(l.begin(), l.end(),
r.begin(), r.end()).first);
}
std::string mutating_common_substring(std::vector<std::string>& range)
{
if (range.empty())
return std::string();
else
{
if (not std::is_sorted(range.begin(), range.end()))
std::sort(range.begin(), range.end());
return common_substring(range.front(), range.back());
}
}
std::vector<std::string> chop(std::vector<std::string> samples)
{
auto str = mutating_common_substring(samples);
for (auto& s : samples)
{
s.erase(s.begin(), std::next(s.begin(), str.size()));
}
return samples;
}
int main()
{
std::vector<std::string> samples = {
"/home/user/foo.txt",
"/home/user/bar.txt",
"/home/baz.txt"
};
samples = chop(std::move(samples));
for (auto& s : samples)
{
std::cout << s << std::endl;
}
}
expected:
baz.txt
user/bar.txt
user/foo.txt
Here's an alternate `common_substring' which does not require a sort. time complexity is in theory O(N) but whether it's faster in practice you'd have to check:
std::string common_substring(const std::vector<std::string>& range)
{
if (range.empty())
{
return {};
}
return std::accumulate(std::next(range.begin(), 1), range.end(), range.front(),
[](auto const& best, const auto& sample)
{
return common_substring(best, sample);
});
}
update:
Elegance aside, this is probably the fastest way since it avoids any memory allocations, performing all transformations in-place. For most architectures and sample sizes, this will matter more than any other performance consideration.
#include <iostream>
#include <vector>
#include <string>
void reduce_to_common(std::string& best, const std::string& sample)
{
best.erase(std::mismatch(best.begin(), best.end(),
sample.begin(), sample.end()).first,
best.end());
}
void remove_common_prefix(std::vector<std::string>& range)
{
if (range.size())
{
auto iter = range.begin();
auto best = *iter;
for ( ; ++iter != range.end() ; )
{
reduce_to_common(best, *iter);
}
auto prefix_length = best.size();
for (auto& s : range)
{
s.erase(s.begin(), std::next(s.begin(), prefix_length));
}
}
}
int main()
{
std::vector<std::string> samples = {
"/home/user/foo.txt",
"/home/user/bar.txt",
"/home/baz.txt"
};
remove_common_prefix(samples);
for (auto& s : samples)
{
std::cout << s << std::endl;
}
}
You have to search every string in the list. However you don't need to compare all the characters in every string. The common prefix can only get shorter, so you only need to compare with "the common prefix so far". I don't think this changes the big-O complexity - but it will make quite a difference to the actual speed.
Also, these look like file names. Are they sorted (bearing in mind that many filesystems tend to return things in sorted order)? If so, you only need to consider the first and last elements. If they are probably pr mostly ordered, then consider the common prefix of the first and last, and then iterate through all the other strings shortening the prefix further as necessary.
You just have to iterate over every string. You can only avoid iterating over the full length of strings needlessly by exploiting the fact, that the prefix can only shorten:
#include <iostream>
#include <string>
#include <vector>
std::string common_prefix(const std::vector<std::string> &ss) {
if (ss.empty())
// no prefix
return "";
std::string prefix = ss[0];
for (size_t i = 1; i < ss.size(); i++) {
size_t c = 0; // index after which the string differ
for (; c < prefix.length(); c++) {
if (prefix[c] != ss[i][c]) {
// strings differ from character c on
break;
}
}
if (c == 0)
// no common prefix
return "";
// the prefix is only up to character c-1, so resize prefix
prefix.resize(c);
}
return prefix;
}
void strip_common_prefix(std::vector<std::string> &ss) {
std::string prefix = common_prefix(ss);
if (prefix.empty())
// no common prefix, nothing to do
return;
// drop the common part, which are always the first prefix.length() characters
for (std::string &s: ss) {
s = s.substr(prefix.length());
}
}
int main()
{
std::vector<std::string> ss { "/home/user/foo.txt", "/home/user/bar.txt", "/home/baz.txt"};
strip_common_prefix(ss);
for (std::string &s: ss)
std::cout << s << "\n";
}
Drawing from the hints of Martin Bonner's answer, you may implement a more efficient algorithm if you have more prior knowledge on your input.
In particular, if you know your input is sorted, it suffices to compare the first and last strings (see Richard's answer).
i - Find the file which has the least folder depth (i.e. baz.txt) - it's root path is home
ii - Then go through the other strings to see if they start with that root.
iii - If so then remove root from all the strings.
Start with std::size_t index=0;. Scan the list to see if characters at that index match (note: past the end does not match). If it does, advance index and repeat.
When done, index will have the value of the length of the prefix.
At this point, I'd advise you to write or find a string_view type. If you do, simply create a string_view for each of your strings str with start/end of index, str.size().
Overall cost: O(|prefix|*N+N), which is also the cost to confirm that your answer is correct.
If you don't want to write a string_view, simply call str.erase(str.begin(), str.begin()+index) on each str in your vector.
Overall cost is O(|total string length|+N). The prefix has to be visited in order to confirm it, then the tail of the string has to be rewritten.
Now the cost of the breadth-first is locality, as you are touching memory all over the place. It will probably be more efficient in practice to do it in chunks, where you scan the first K strings up to length Q and find the common prefix, then chain that common prefix plus the next block. This won't change the O-notation, but will improve locality of memory reference.
for(vector<string>::iterator itr=V.begin(); itr!=V.end(); ++itr)
itr->erase(0,6);
I want to check for a word contained within a bigger string, but not necessarily in the same order. Example: The program will check if the word "car" exists in "crqijfnsa". In this case, it does, because the second string contains c, a, and r.
You could build a map containing the letters "car" with the values set to 0. Cycle through the array with all the letters and if it is a letter in the word "car" change the value to 1. If all the keys in the map have a value greater than 0, than the word can be constructed. Try implementing this.
An anagram is a type of word play, the result of rearranging the letters of a word or phrase to produce a new word or phrase, using all the original letters exactly once;
So, actually what you are looking for is an algorithm to check if two words are "Anagrams" are not.
Following thread provides psuedocode that might be helpful
Finding anagrams for a given word
A very primitive code would be something like this:
for ( std::string::iterator it=str.begin(); it!=str.end(); ++it)
for ( std::string::iterator it2=str2.begin(); it2!=str2.end(); ++it2) {
if (*it == *it2) {
str2.erase(it);
break;
}
}
if (str2.empty())
found = true;
You could build up a table of count of characters of each letter in the word you are searching for, then decrement those counts as you work through the search string.
bool IsWordInString(const char* word, const char* str)
{
// build up table of characters in word to match
std::array<int, 256> cword = {0};
for(;*word;++word) {
cword[*word]++;
}
// work through str matching characters in word
for(;*str; ++str) {
if (cword[*str] > 0) {
cword[*str]--;
}
}
return std::accumulate(cword.begin(), cword.end(), 0) == 0;
}
It's also possible to return as soon as you find a match, but the code isn't as simple.
bool IsWordInString(const char* word, const char* str)
{
// empty string
if (*word == 0)
return true;
// build up table of characters in word to match
int unmatched = 0;
char cword[256] = {0};
for(;*word;++word) {
cword[*word]++;
unmatched++;
}
// work through str matching characters in word
for(;*str; ++str) {
if (cword[*str] > 0) {
cword[*str]--;
unmatched--;
if (unmatched == 0)
return true;
}
}
return false;
}
Some test cases
"" in "crqijfnsa" => 1
"car" in "crqijfnsa" => 1
"ccar" in "crqijfnsa" => 0
"ccar" in "crqijfnsac" => 1
I think the easiest (and probably fastest, test that youself :) ) implementation would be done with std::includes:
std::string testword {"car"};
std::string testarray {"crqijfnsa"};
std::sort(testword.begin(),testword.end());
std::sort(testarray.begin(),testarray.end());
bool is_in_array = std::includes(testarray.begin(),testarray.end(),
testword.begin(),testword.end());
This also handles all cases of duplicate letters correctly.
The complexity of this approach should be O(n * log n) where n is the length of testarray. (sort is O(n log n) and includes has linear complexity.
I try to produce / get an efficient implementation in C++ for the following problem:
I have to blobs (const char *data, size_t length), i call them "blob1" and "blob2". Now i like to get the longest prefix of "blob2" in "blob1". If the longest prefix is multiple times in "blob1" i like to get the one which has the biggest index.
Here an example (the blobs are here just ASCII-strings, so it's easier to read the example):
blob1 = HELLO LOOO HELOO LOOO LOO JU
blob2 = LOOO TUS
The following are all valid prefixes of blob2:
{ L, LO, LOO, LOOO, LOOO, LOOO T, LOOO TU, LOOO TUS }
The longest prefix of blob2 in blob1 is LOOO. It is there twice:
HELLO *LOOO *HELOO *LOOO *LOO JU
So i like to get the index of the second one, which would be 6, and the length of the prefix which would be 4.
Unfortunately blob1 and blob2 change many times, so it is probably slow to create a tree or some other complex structure.
Do you know a good algorithm to solve this problem?
Thank you.
Cheers
Kevin
I dont know if this the best algorithm to solve this (I'm sure, this is not), but, I guess this is a good one. The idea is simple, start by searching for the lowest token from blob2 in blob1. When you find a match, try to see if you can match biggers tokens at this position. If this is true, update your token length.
Continue your search, from your last stop, but, at this time, searching for token with the updated token length from blob2. When you find a match, try to see if you can match biggers tokens at this position. If this is true, update your token length. Repeat the previous procedure until the end of your buffer.
Bellow is a simple fluxogram trying to explain this algorithm and, in sequence, a simple complete program, showing an implementation.
#include <algorithm>
#include <vector>
#include <iostream>
/////////////////////0123456789012345678901234567
const char str1[] = "HELLO LOOO HELOO LOOO LOO JU";
const char str2[] = "LOOO TUS";
int main()
{
std::vector<char> blob1(strlen(str1));
std::vector<char> blob2(strlen(str2));
blob1.reserve(30);
blob2.reserve(30);
std::copy(str1, str1+strlen(str1), blob1.begin());
std::copy(str2, str2+strlen(str2), blob2.begin());
auto next = blob1.begin();
auto tokenLength = 1;
auto position = -1;
while ( std::next(next, tokenLength) < blob1.end() ) {
auto current = std::search(next,
blob1.end(),
blob2.begin(),
std::next(blob2.begin(), tokenLength));
if (current == blob1.end() )
break;
position = std::distance(blob1.begin(), current);
next = std::next(current, 1);
for (auto i = tokenLength; std::next(blob2.begin(), i) < blob2.end(); ++i) {
auto x = std::search(std::next(current, i),
std::next(current, i + 1),
std::next(blob2.begin(), i),
std::next(blob2.begin(), i + 1));
if ( x != std::next(current, i) )
break;
++tokenLength;
}
}
std::cout << "Index: " << position << ", length: " << tokenLength << std::endl;
}
New to using boost. Using it to load a collection of images. The issue is that the images will continue to grow in number in the folder and I will eventually not want to add all of them to my display program. I am on OS X and using C++.
How can I adjust this example code to only load say only 30 images from the top or bottom of the directory? Loading only the newest files would be awesome, but I would settle for just changing this. Unfortunately just saying (it <30) in my loop doesn't work because it needs to be equivalent to fs::directory_iterator.
Example code:
fs::path pPhoto( photobooth_texture_path );
for ( fs::directory_iterator it( pPhoto ); it != fs::directory_iterator(); ++it )
{
if ( fs::is_regular_file( *it ) )
{
// -- Perhaps there is a better way to ignore hidden files
string photoFileName = it->path().filename().string();
if( !( photoFileName.compare( ".DS_Store" ) == 0 ) )
{
photoboothTex.push_back( gl::Texture( loadImage( photobooth_texture_path + photoFileName ), mipFmt) );
cout << "Loaded: " << photoFileName <<endl;
}
}
}
EDIT: This is how I ended up doing it. Sort of a hybrid of the two methods, but i needed to sort backwards, even though it won't necessarily be predictably going backwards...taking my chances. Not the cleanest thing in the world, but i had to translate their ideas to a flavor of C++ I understood
vector<string> fileList;
int count = 0;
photoboothTex.clear();//clear this out to make way for new photos
fs::path pPhoto( photobooth_texture_path );
for ( fs::directory_iterator it( pPhoto ); it != fs::directory_iterator(); ++it ) {
if ( fs::is_regular_file( *it ) )
{
// -- Perhaps there is a better way to ignore hidden files
string photoFileName = it->path().filename().string();
if( !( photoFileName.compare( ".DS_Store" ) == 0 ) )
{
fileList.push_back(photoFileName);
}
}
}
for (int i=(fileList.size()-1); i!=0; i--) {
photoboothTex.push_back( gl::Texture( loadImage( photobooth_texture_path + fileList[i%fileList.size()] )) );
cout << "Loaded Photobooth: " << fileList[i%fileList.size()] <<endl;
if(++count ==40) break; //loads a maximum of 40 images
}
Here's a working example that uses boost::filter_iterator with directory_iterator to store paths to regular files in a vector. I sorted the vector based on last_write_time(). I also ommited error checking for brevity - this example will crash if there's less than 30 files in the directory.
#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <boost/filesystem.hpp>
#include <boost/iterator/filter_iterator.hpp>
namespace fs = boost::filesystem;
int main()
{
fs::path p("My image directory");
fs::directory_iterator dir_first(p), dir_last;
std::vector<fs::path> files;
auto pred = [](const fs::directory_entry& p)
{
return fs::is_regular_file(p);
};
std::copy(boost::make_filter_iterator(pred, dir_first, dir_last),
boost::make_filter_iterator(pred, dir_last, dir_last),
std::back_inserter(files)
);
std::sort(files.begin(), files.end(),
[](const fs::path& p1, const fs::path& p2)
{
return fs::last_write_time(p1) < fs::last_write_time(p2);
});
std::copy_n(files.begin(), 30, std::ostream_iterator<fs::path>(std::cout, "\n"));
}
To make your example work, you could structure the for-loop like this:
fs::path pPhoto( photobooth_texture_path );
fs::directory_iterator it( pPhoto );
for ( size_t i = 0; i < 30 && it != fs::directory_iterator(); ++it )
{
if ( fs::is_regular_file( *it ) )
{
// load the image
++i;
}
}
Obviously you can't just say it < 30 because 30 isn't a directory_iterator.
And, even if you could, that would only count the first 30 files period, not the first 30 non-hidden files, which I suspect isn't what you want (especially since the usual *nix rule for "hidden" is "starts with '.'", and those files tend to come first).
But you can easily keep track of the count yourself:
int count = 0;
fs::path pPhoto( photobooth_texture_path );
for ( fs::directory_iterator it( pPhoto ); it != fs::directory_iterator(); ++it )
{
if ( fs::is_regular_file( *it ) )
{
// -- Perhaps there is a better way to ignore hidden files
string photoFileName = it->path().filename().string();
if( !( photoFileName.compare( ".DS_Store" ) == 0 ) )
{
photoboothTex.push_back( gl::Texture( loadImage( photobooth_texture_path + photoFileName ), mipFmt) );
cout << "Loaded: " << photoFileName <<endl;
if (++count == 30) break;
}
}
}
That's it.
Loading only the newest files would be awesome, but I would settle for just changing this.
This does not get the newest 30, just "some 30". boost::filesystem iterates "as if by calling POSIX readdir_r()", and readdir_r iterates over a directory stream which is specified as "an ordered sequence of all the directory entries in a particular directory", but there's no way to tell it what ordering you want for that sequence.
Of course you can add ordering yourself by reading in the whole list, then sorting however you want. See jrok's answer above for that. But there are some down-sides to this:
It's not as simple.
It's going to be much slower if you have large directories (because you have to read in, and possibly stat, all 3000 entries to sort them, instead of just reading 30).
It's going to take much more memory.
Ultimately, it's a tradeoff.
While it's not as simple, someone (jrok) has already written the code, and understanding his code is a worthwhile learning experience anyway. While it's "much slower", it may still be "more than fast enough". While it takes "much more memory", it's likely still just a drop in the bucket. But you have to evaluate those factors and decide for yourself.
I'll mention two other quick things:
First, if speed is not an issue but memory is (very unlikely, but not completely impossible), you can make the code a bit more complex by just keeping the last 30 files found so far, instead of all of them. (For example, stick them in a set instead of a vector; for each new value, if it's older than the oldest value in the set, ignore it; otherwise, insert it into the set and remove the oldest value.)
Second, if you don't care about portability, and are willing to trade from boost::filesystem to some ugly, platform-specific, C-based API, your platform might have a way to read directory entries in sorted order. But I wouldn't pursue this unless you really do need both ordering and efficiency, so much that you're willing to completely sacrifice portability and simplicity.
I know hashing infinite number of string into 32b int must generate collision, but I expect from hashing function some nice distribution.
Isn't it weird that these 2 strings have the same hash?
size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
//hash0 == hash1
I know I can use boost::hash<std::string> or others, but I want to know what is wrong with std::hash. Am I using it wrong? Shouldn't I somehow "seed" it?
There's nothing wrong with your usage of std::hash. The problem is that the specialization std::hash<std::string> provided by the standard library implementation bundled with Visual Studio 2010 only takes a subset of the string's characters to determine the hash value (presumably for performance reasons). Coincidentally the last character of a string with 14 characters is not part of this set, which is why both strings yield the same hash value.
As far as I know this behaviour is in conformance with the standard, which demands only that multiple calls to the hash function with the same argument must always return the same value. However, the probability of a hash collision should be minimal. The VS2010 implementation fulfills the mandatory part, yet fails to account for the optional one.
For details, see the implementation in the header file xfunctional (starting at line 869 in my copy) and §17.6.3.4 of the C++ standard (latest public draft).
If you absolutely need a better hash function for strings, you should implement it yourself. It's actually not that hard.
The exact hash algorithm isn't specified by the standard, so the results
will vary. The algorithm used by VC10 doesn't seem to take all of the
characters into account if the string is longer than 10 characters; it
advances with an increment of 1 + s.size() / 10. This is legal,
albeit from a QoI point of view, rather disappointing; such hash codes
are known to perform very poorly for some typical sets of data (like
URLs). I'd strongly suggest you replace it with either a FNV hash or
one based on a Mersenne prime:
FNV hash:
struct hash
{
size_t operator()( std::string const& s ) const
{
size_t result = 2166136261U ;
std::string::const_iterator end = s.end() ;
for ( std::string::const_iterator iter = s.begin() ;
iter != end ;
++ iter ) {
result = (16777619 * result)
^ static_cast< unsigned char >( *iter ) ;
}
return result ;
}
};
Mersenne prime hash:
struct hash
{
size_t operator()( std::string const& s ) const
{
size_t result = 2166136261U ;
std::string::const_iterator end = s.end() ;
for ( std::string::const_iterator iter = s.begin() ;
iter != end ;
++ iter ) {
result = 127 * result
+ static_cast< unsigned char >( *iter ) ;
}
return result ;
}
};
(The FNV hash is supposedly better, but the Mersenne prime hash will be
faster on a lot of machines, because multiplying by 127 is often
significantly faster than multiplying by 16777619.)
You should likely get different hash values. I get different hash values (GCC 4.5):
hashtest.cpp
#include <string>
#include <iostream>
#include <functional>
int main(int argc, char** argv)
{
size_t hash0 = std::hash<std::string>()("generated_id_0");
size_t hash1 = std::hash<std::string>()("generated_id_1");
std::cout << hash0 << (hash0 == hash1 ? " == " : " != ") << hash1 << "\n";
return 0;
}
Output
# g++ hashtest.cpp -o hashtest -std=gnu++0x
# ./hashtest
16797002355621538189 != 16797001256109909978
You do not seed hashing function, you can just salt "them" at most.
The function is used in the right way and this collision could be just fortuitous.
You cannot tell whether the hashing function is not evenly distributed unless you perform a massive test with random keys.
The TR1 hash function and the newest standard define proper overloads for things like strings. When I run this code using std::tr1::hash (g++ 4.1.2), I get different hash values for these two strings.