Removing specified characters from a string - Efficient methods (time and space complexity) - c++

Here is the problem: Remove specified characters from a given string.
Input: The string is "Hello World!" and characters to be deleted are "lor"
Output: "He Wd!"
Solving this involves two sub-parts:
Determining if the given character is to be deleted
If so, then deleting the character
To solve the first part, I am reading the characters to be deleted into a std::unordered_map, i.e. I parse the string "lor" and insert each character into the hashmap. Later, when I am parsing the main string, I will look into this hashmap with each character as the key and if the returned value is non-zero, then I delete the character from the string.
Question 1: Is this the best approach?
Question 2: Which would be better for this problem? std::map or std::unordered_map? Since I am not interested in ordering, I used an unordered_map. But is there a higher overhead for creating the hash table? What to do in such situations? Use a map (balanced tree) or a unordered_map (hash table)?
Now coming to the next part, i.e. deleting the characters from the string. One approach is to delete the character and shift the data from that point on, back by one position. In the worst case, where we have to delete all the characters, this would take O(n^2).
The second approach would be to copy only the required characters to another buffer. This would involve allocating enough memory to hold the original string and copy over character by character leaving out the ones that are to be deleted. Although this requires additional memory, this would be a O(n) operation.
The third approach, would be to start reading and writing from the 0th position, increment the source pointer when every time I read and increment the destination pointer only when I write. Since source pointer will always be same or ahead of destination pointer, I can write over the same buffer. This saves memory and is also an O(n) operation. I am doing the same and calling resize in the end to remove the additional unnecessary characters?
Here is the function I have written:
// str contains the string (Hello World!)
// chars contains the characters to be deleted (lor)
void remove_chars(string& str, const string& chars)
{
unordered_map<char, int> chars_map;
for(string::size_type i = 0; i < chars.size(); ++i)
chars_map[chars[i]] = 1;
string::size_type i = 0; // source
string::size_type j = 0; // destination
while(i < str.size())
{
if(chars_map[str[i]] != 0)
++i;
else
{
str[j] = str[i];
++i;
++j;
}
}
str.resize(j);
}
Question 3: What are the different ways by which I can improve this function. Or is this best we can do?
Thanks!

Good job, now learn about the standard library algorithms and boost:
str.erase(std::remove_if(str.begin(), str.end(), boost::is_any_of("lor")), str.end());

Assuming that you're studying algorithms, and not interested in library solutions:
Hash tables are most valuable when the number of possible keys is large, but you only need to store a few of them. Your hash table would make sense if you were deleting specific 32-bit integers from digit sequences. But with ASCII characters, it's overkill.
Just make an array of 256 bools and set a flag for the characters you want to delete. It only uses one table lookup instruction per input character. Hash map involves at least a few more instructions to compute the hash function. Space-wise, they are probably no more compact once you add up all the auxiliary data.
void remove_chars(string& str, const string& chars)
{
// set up the look-up table
std::vector<bool> discard(256, false);
for (int i = 0; i < chars.size(); ++i)
{
discard[chars[i]] = true;
}
for (int j = 0; j < str.size(); ++j)
{
if (discard[str[j]])
{
// do something, depending on your storage choice
}
}
}
Regarding your storage choices: Choose between options 2 and 3 depending on whether you need to preserve the input data or not. 3 is obviously most efficient, but you don't always want an in-place procedure.

Here is a KISS solution with many advantages:
void remove_chars (char *dest, const char *src, const char *excludes)
{
do {
if (!strchr (excludes, *src))
*dest++ = *src;
} while (*src++);
*dest = '\000';
}

You can ping pong between strcspn and strspn to avoid the need for a hash table:
void remove_chars(
const char *input,
char *output,
const char *characters)
{
const char *next_input= input;
char *next_output= output;
while (*next_input!='\0')
{
int copy_length= strspn(next_input, characters);
memcpy(next_output, next_input, copy_length);
next_output+= copy_length;
next_input+= copy_length;
next_input+= strcspn(next_input, characters);
}
}

Related

Map arbitrary set of symbols to consecutive integers

Given a set of byte-representable symbols (e.g. characters, short strings, etc), is there a way to find a mapping from that set to a set of consecutive natural numbers that includes 0? For example, suppose there is the following unordered set of characters (not necessarily in any particular character set).
'a', '(', '🍌'
Is there a way to find a "hash" function of sorts that would map each symbol (e.g. by means of its byte representation) uniquely to one of the integers 0, 1, and 2, in any order? For example, 'a'=0, '('=1, '🍌'=2 is just as valid as 'a'=2, '('=0, '🍌'=1.
Why?
Because I am developing something for a memory-constrained (think on the order of kiB) embedded target that has a lot of fixed reverse-lookup tables, so something like std::unordered_map would be out of the question. The ETL equivalent etl::unordered_map would be getting there, but there's quite a bit of size overhead, and collisions can happen, so lookup timings could differ. A sparse lookup table would work, where the byte representation of the symbol would be the index, but that would be a lot of wasted space, and there are many different tables.
There's also the chance that the "hash" function may end up costing more than the above alternatives, but my curiosity is a strong driving force. Also, although both C and C++ are tagged, this question is specific to neither of them. I just happen to be using C/C++.
The normal way to do things like this, for example when coding a font for a custom display, is to map everything to a sorted, read-only look-up table array with indices 0 to 127 or 0 to 255. Where symbols corresponding to the old ASCII table are mapped to their respective index. And other things like your banana symbol could be mapped beyond index 127.
So when you use FONT [97] or FONT ['a'], you end up with the symbol corresponding to 'a'. That way you can translate from ASCII strings to your custom table, or from your source editor font to the custom table.
Using any other data type such as a hash table sounds like muddy program design to me. Embedded systems should by their nature be deterministic, so overly complex data structures don't make sense most of the time. If you for some reason unknown must have the data unordered, then you should describe the reason why in detail, or otherwise you are surely asking an "XY question".
Yes, there is such a map. Just put all of them in an array of strings... then sort it, and make a function that searchs for the word in the array and returns the index in the array.
static char *strings[] = {
"word1", "word2", "hello", "world", NULL, /* to end the array */
};
int word_number(const char *word)
{
for(int i = 0; strings[i] != NULL; i++) {
if (strcmp(strings[i], word) == 0)
return i;
}
return -1; /* not found */
}
The cost of this (in space terms) is very low (considering that the compiler assigning pointers can optimice string allocation based on common suffixes (making a string overlap others if it is a common suffix of them) and if you give the compiler an already sorted array of literals, you can use bsearch() algorithm (which is O(log(n)) of the number of elements in the table)
static char *strings[] = { /* this time sorted */
"Hello",
"rella", /* this and the next are merged into positions on the same literal below
* This can be controlled with a compiler option. */
"umbrella",
"world"
};
const int strings_len = 4;
int string_cmp(const void *_s1, const void *_s2)
{
const char *s1 = _s1, *s2 = _s2;
return strcmp(s1, s2);
}
int word_number(const char *word)
{
char *result = bsearch(strings, 4, sizeof *strings, string_cmp);
return result ? result - strings : -1;
}
If you want a function that gives you a number for any string, and maps biyectively that string with that number... It's even easier. First start with zero. For each byte in the string, just multiply your number by 256 (the number of byte values) and add the next byte to the result, then return back that result once you have done this operation with every char in the string. You will get a different number for each possible string, covering all possible strings and all possible numbers. But I think this is not what you want.
super_long_integer char2number(const unsigned char *s)
{
super_long_integer result = 0;
int c;
while ((c = *s++) != 0) {
result *= 256;
result += c;
}
return result;
}
But that integer must be capable of supporting numbers in the range [0...256^(maximum lenght of accepted string)] which is a very large number.

Splitting up a string from end to start into groups of two in C++

I was curious about the way I could make a program, that takes a string, then detects the end of it, and then starts splitting it up "from end toward the start", into the groups of two?
For instance, the user enters mskkllkkk and the output has to be m sk kl lk kk.
I tried to search the net for the tools I needed, and got familiar with iterators, and tried to use them for this purpose. I did something like this:
#include "iostream"
#include "string"
#include "conio.h"
int main() {
int k=0,i=-1;
std::string str1;
std::string::iterator PlaceCounter;
std::cin >> str1;
PlaceCounter = str1.end();
for (PlaceCounter; PlaceCounter != str1.begin(); --PlaceCounter)
{
++k;
if (k % 2 == 0 && k-1 != 0) {
++i;
str1.insert(str1.end()-k-i,' ');
}
}
std::cout << str1;
_getch();
return 0;
}
At first, it seemed to be working just fine when I entered a couple of arbitrary cases(Such thing can exactly be used in calculators to make the numbers more readable by putting each three digits in one group, from the end toward the start), But suddenly when I entered this: jsfksdjfksdjfkdsjfskjdfkjsfn , I got the error message:"String iterator not decrementable".
Presumably I need to study much more pages of my book for C++ to be able to solve this myself, but for now I'm just being super-curious as a beginner. Why is that error message? Thanks in advance.
When you insert() into your string the iterators to it may get invalidated. In particular all iterators past the insertion point should be considered invalidated in all cases but also all iterators get invalidated if the std::string needs to get more memory: the internal buffer will be replaced by a bigger one, causing all existing iterator (and references and pointers) to string elements to be invalidated.
The easiest fix to the problem is to make sure that the string doesn't need to allocate more memory by reserve()ing enough space ahead of time. Since you add one space for every two characters, making sure that there is space for str1.size() + str1.size() / 2u characters should be sufficient:
str1.reserve(str1.size() + str1.size() / 2u);
for (auto PlaceCounter = str1.end(); PlaceCounter != str1.begin(); --PlaceCounter) {
// ...
}
Note that your algorithm is rather inefficient: it is an O(n2). The operation can be done with O(n) complexity instead. You'd resize the string to the appropriate size right from the start, filling the tail with some default characters and then copy the content moving from the end directly to the appropriate location.
str1.insert(str1.end()-k-i,' ');
This modifies the string the loop is iterating over. Specifically, this inserts something into the string.
With a std::string, much like a std::vector, insertion into a string will (may) invalidate all existing iterators pointing to the string. The first insertion performed by the shown code results in undefined behavior, as soon as the existing, now invalidated, iterators are referenced afterwards.
You will need to either replace your iterators with indexes into the string, or instead of modifying the existing string construct a new string, leaving the original string untouched.
Here is a possible C++ approach to try. From my tool bag, here is how I insert commas into a decimal string (i.e. s is expected to contain digits):
Input: "123456789"
// insert comma's from right (at implied decimal point) to left
std::string digiCommaL(std::string s)
{
// Note: decrementing a uint (such as size_t) will loop-around,
// and not underflow. Be sure to use int ...
int32_t sSize = static_cast<int32_t>(s.size()); // change to int
// ^^^^^-----------_____
if (sSize > 3) vvvvv
for (int32_t indx = (sSize - 3); indx > 0; indx -= 3)
s.insert(static_cast<size_t>(indx), 1, ',');
return(s);
}
Returns: "123,456,789"

Why does my array element retrieval function return random value?

I am trying to make an own simple string implementation in C++. My implementation is not \0 delimited, but uses the first element in my character array (the data structure I have chosen to implement the string) as the length of the string.
In essence, I have this as my data structure: typedef char * arrayString; and I have got the following as the implementation of some primal string manipulating routines:
#include "stdafx.h"
#include <iostream>
#include "new_string.h"
// Our string implementation will store the
// length of the string in the first byte of
// the string.
int getLength(const arrayString &s1) {
return s1[0] - '0';
}
void append_str(arrayString &s, char c) {
int length = getLength(s); // get the length of our current string
length++; // account for the new character
arrayString newString = new char[length]; // create a new heap allocated string
newString[0] = length;
// fill the string with the old contents
for (int counter = 1; counter < length; counter++) {
newString[counter] = s[counter];
}
// append the new character
newString[length - 1] = c;
delete[] s; // prevent a memory leak
s = newString;
}
void display(const arrayString &s1) {
int max = getLength(s1);
for (int counter = 1; counter <= max; counter++) {
std::cout << s1[counter];
}
}
void appendTest() {
arrayString a = new char[5];
a[0] = '5'; a[1] = 'f'; a[2] = 'o'; a[3] = 't'; a[4] = 'i';
append_str(a, 's');
display(a);
}
My issue is with the implementation of my function getLength(). I have tried to debug my program inside Visual Studio, and all seems nice and well in the beginning.
The first time getLength() is called, inside the append_str() function, it returns the correct value for the string length (5). When it get's called inside the display(), my own custom string displaying function (to prevent a bug with std::cout), it reads the value (6) correctly, but returns -42? What's going on?
NOTES
Ignore my comments in the code. It's purely educational and it's just me trying to see what level of commenting improves the code and what level reduces its quality.
In get_length(), I had to do first_element - '0' because otherwise, the function would return the ascii value of the arithmetic value inside. For instance, for decimal 6, it returned 54.
This is an educational endeavour, so if you see anything else worth commenting on, or fixing, by all means, let me know.
Since you are getting the length as return s1[0] - '0'; in getLength() you should set then length as newString[0] = length + '0'; instead of newString[0] = length;
As a side why are you storing the size of the string in the array? why not have some sort of integer member that you store the size in. A couple of bytes really isn't going to hurt and now you have a string that can be more than 256 characters long.
You are accessing your array out of bounds at couple of places.
In append_str
for (int counter = 1; counter < length; counter++) {
newString[counter] = s[counter];
}
In the example you presented, the starting string is "5foti" -- without the terminating null character. The maximum valid index is 4. In the above function, length has already been set to 6 and you are accessing s[5].
This can be fixed by changing the conditional in the for statement to counter < length-1;
And in display.
int max = getLength(s1);
for (int counter = 1; counter <= max; counter++) {
std::cout << s1[counter];
}
Here again, you are accessing the array out of bounds by using counter <= max in the loop.
This can be fixed by changing the conditional in the for statement to counter < max;
Here are some improvements, that should also cover your question:
Instead of a typedef, define a class for your string. The class should have an int for the length and a char* for the string data itself.
Use operator overloads in your class "string" so you can append them with + etc.
The - '0' gives me pain. You subtract the ASCII value of 42 from the length, but you do not add it as a character. Also, the length can be 127 at maximum, because char goes from -128 to +127. See point #1.
append_str changes the pointer of your object. That's very bad practice!
Ok, thank you everyone for helping me out.
The problem appeared to be inside the appendTest() function, where I was storing in the first element of the array the character code for the value I wanted to have as a size (i.e storing '5' instead of just 5). It seems that I didn't edit previous code that I had correctly, and that's what caused me the issues.
As an aside to what many of you are asking, why am I not using classes or better design, it's because I want to implement a basic string structure having many constraints, such as no classes, etc. I basically want to use only arrays, and the most I am affording myself is to make them dynamically allocated, i.e resizable.

Recursive String Transformations

EDIT: I've made the main change of using iterators to keep track of successive positions in the bit and character strings and pass the latter by const ref. Now, when I copy the sample inputs onto themselves multiple times to test the clock, everything finishes within 10 seconds for really long bit and character strings and even up to 50 lines of sample input. But, still when I submit, CodeEval says the process was aborted after 10 seconds. As I mention, they don't share their input so now that "extensions" of the sample input work, I'm not sure how to proceed. Any thoughts on an additional improvement to increase my recursive performance would be greatly appreciated.
NOTE: Memoization was a good suggestion but I could not figure out how to implement it in this case since I'm not sure how to store the bit-to-char correlation in a static look-up table. The only thing I thought of was to convert the bit values to their corresponding integer but that risks integer overflow for long bit strings and seems like it would take too long to compute. Further suggestions for memoization here would be greatly appreciated as well.
This is actually one of the moderate CodeEval challenges. They don't share the sample input or output for moderate challenges but the output "fail error" simply says "aborted after 10 seconds," so my code is getting hung up somewhere.
The assignment is simple enough. You take a filepath as the single command-line argument. Each line of the file will contain a sequence of 0s and 1s and a sequence of As and Bs, separated by a white space. You are to determine whether the binary sequence can be transformed into the letter sequence according to the following two rules:
1) Each 0 can be converted to any non-empty sequence of As (e.g, 'A', 'AA', 'AAA', etc.)
2) Each 1 can be converted to any non-empty sequences of As OR Bs (e.g., 'A', 'AA', etc., or 'B', 'BB', etc) (but not a mixture of the letters)
The constraints are to process up to 50 lines from the file and that the length of the binary sequence is in [1,150] and that of the letter sequence is in [1,1000].
The most obvious starting algorithm is to do this recursively. What I came up with was for each bit, collapse the entire next allowed group of characters first, test the shortened bit and character strings. If it fails, add back one character from the killed character group at a time and call again.
Here is my complete code. I removed cmd-line argument error checking for brevity.
#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
using namespace std;
//typedefs
typedef string::const_iterator str_it;
//declarations
//use const ref and iterators to save time on copying and erasing
bool TransformLine(const string & bits, str_it bits_front, const string & chars, str_it chars_front);
int main(int argc, char* argv[])
{
//check there are at least two command line arguments: binary executable and file name
//ignore additional arguments
if(argc < 2)
{
cout << "Invalid command line argument. No input file name provided." << "\n"
<< "Goodybe...";
return -1;
}
//create input stream and open file
ifstream in;
in.open(argv[1], ios::in);
while(!in.is_open())
{
char* name;
cout << "Invalid file name. Please enter file name: ";
cin >> name;
in.open(name, ios::in);
}
//variables
string line_bits, line_chars;
//reserve space up to constraints to reduce resizing time later
line_bits.reserve(150);
line_chars.reserve(1000);
int line = 0;
//loop over lines (<=50 by constraint, ignore the rest)
while((in >> line_bits >> line_chars) && (line < 50))
{
line++;
//impose bit and char constraints
if(line_bits.length() > 150 ||
line_chars.length() > 1000)
continue; //skip this line
(TransformLine(line_bits, line_bits.begin(), line_chars, line_chars.begin()) == true) ? (cout << "Yes\n") : (cout << "No\n");
}
//close file
in.close();
return 0;
}
bool TransformLine(const string & bits, str_it bits_front, const string & chars, str_it chars_front)
{
//using iterators so store current length as local const
//can make these const because they're not altered here
int bits_length = distance(bits_front, bits.end());
int chars_length = distance(chars_front, chars.end());
//check success rule
if(bits_length == 0 && chars_length == 0)
return true;
//Check fail rules:
//1. next bit is 0 but next char is B
//2. bits length is zero (but char is not, by previous if)
//3. char length is zero (but bits length is not, by previous if)
if((*bits_front == '0' && *chars_front == 'B') ||
bits_length == 0 ||
chars_length == 0)
return false;
//we now know that chars_length != 0 => chars_front != chars.end()
//kill a bit and then call recursively with each possible reduction of front char group
bits_length = distance(++bits_front, bits.end());
//current char group tracker
const char curr_char_type = *chars_front; //use const so compiler can optimize
int curr_pos = distance(chars.begin(), chars_front); //position of current front in char string
//since chars are 0-indexed, the following is also length of current char group
//start searching from curr_pos and length is relative to curr_pos so subtract it!!!
int curr_group_length = chars.find_first_not_of(curr_char_type, curr_pos)-curr_pos;
//make sure this isn't the last group!
if(curr_group_length < 0 || curr_group_length > chars_length)
curr_group_length = chars_length; //distance to end is precisely distance(chars_front, chars.end()) = chars_length
//kill the curr_char_group
//if curr_group_length = char_length then this will make chars_front = chars.end()
//and this will mean that chars_length will be 0 on next recurssive call.
chars_front += curr_group_length;
curr_pos = distance(chars.begin(), chars_front);
//call recursively, adding back a char from the current group until 1 less than starting point
int added_back = 0;
while(added_back < curr_group_length)
{
if(TransformLine(bits, bits_front, chars, chars_front))
return true;
//insert back one char from the current group
else
{
added_back++;
chars_front--; //represents adding back one character from the group
}
}
//if here then all recursive checks failed so initial must fail
return false;
}
They give the following test cases, which my code solves correctly:
Sample input:
1| 1010 AAAAABBBBAAAA
2| 00 AAAAAA
3| 01001110 AAAABAAABBBBBBAAAAAAA
4| 1100110 BBAABABBA
Correct output:
1| Yes
2| Yes
3| Yes
4| No
Since a transformation is possible if and only if copies of it are, I tried just copying each binary and letter sequences onto itself various times and seeing how the clock goes. Even for very long bit and character strings and many lines it has finished in under 10 seconds.
My question is: since CodeEval is still saying it is running longer than 10 seconds but they don't share their input, does anyone have any further suggestions to improve the performance of this recursion? Or maybe a totally different approach?
Thank you in advance for your help!
Here's what I found:
Pass by constant reference
Strings and other large data structures should be passed by constant reference.
This allows the compiler to pass a pointer to the original object, rather than making a copy of the data structure.
Call functions once, save result
You are calling bits.length() twice. You should call it once and save the result in a constant variable. This allows you to check the status again without calling the function.
Function calls are expensive for time critical programs.
Use constant variables
If you are not going to modify a variable after assignment, use the const in the declaration:
const char curr_char_type = chars[0];
The const allows compilers to perform higher order optimization and provides safety checks.
Change data structures
Since you are perform inserts maybe in the middle of a string, you should use a different data structure for the characters. The std::string data type may need to reallocate after an insertion AND move the letters further down. Insertion is faster with a std::list<char> because a linked list only swaps pointers. There may be a trade off because a linked list needs to dynamically allocate memory for each character.
Reserve space in your strings
When you create the destination strings, you should use a constructor that preallocates or reserves room for the largest size string. This will prevent the std::string from reallocating. Reallocations are expensive.
Don't erase
Do you really need to erase characters in the string?
By using starting and ending indices, you overwrite existing letters without have to erase the entire string.
Partial erasures are expensive. Complete erasures are not.
For more assistance, post to Code Review at StackExchange.
This is a classic recursion problem. However, a naive implementation of the recursion would lead to an exponential number of re-evaluations of a previously computed function value. Using a simpler example for illustration, compare the runtime of the following two functions for a reasonably large N. Lets not worry about the int overflowing.
int RecursiveFib(int N)
{
if(N<=1)
return 1;
return RecursiveFib(N-1) + RecursiveFib(N-2);
}
int IterativeFib(int N)
{
if(N<=1)
return 1;
int a_0 = 1, a_1 = 1;
for(int i=2;i<=N;i++)
{
int temp = a_1;
a_1 += a_0;
a_0 = temp;
}
return a_1;
}
You would need to follow a similar approach here. There are two common ways of approaching the problem - dynamic programming and memoization. Memoization is the easiest way of modifying your approach. Below is a memoized fibonacci implementation to illustrate how your implementation can be speeded up.
int MemoFib(int N)
{
static vector<int> memo(N, -1);
if(N<=1)
return 1;
int& res = memo[N];
if(res!=-1)
return res;
return res = MemoFib(N-1) + MemoFib(N-2);
}
Your failure message is "Aborted after 10 seconds" -- implying that the program was working fine as far as it went, but it took too long. This is understandable, given that your recursive program takes exponentially more time for longer input strings -- it works fine for the short (2-8 digit) strings, but will take a huge amount of time for 100+ digit strings (which the test allows for). To see how your running time goes up, you should construct yourself some longer test inputs and see how long they take to run. Try things like
0000000011111111 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBAAAAAAAA
00000000111111110000000011111111 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBAAAAAAAA
and longer. You need to be able to handle up to 150 digits and 1000 letters.
At CodeEval, you can submit a "solution" that just outputs what the input is, and do that to gather their test set. They may have variations so you may wish to submit it a few times to gather more samples. Some of them are too difficult to solve manually though... the ones you can solve manually will also run very quickly at CodeEval too, even with inefficient solutions, so there's that to consider.
Anyway, I did this same problem at CodeEval (using VB of all things), and my solution recursively looked for the "next index" of both A and B depending on what the "current" index is for where I was in a translation (after checking stoppage conditions first thing in the recursive method). I did not use memoization but that might've helped speed it up even more.
PS, I have not run your code, but it does seem curious that the recursive method contains a while loop within which the recursive method is called... since it's already recursive and should therefore encompass every scenario, is that while() loop necessary?

Which data structure and algorithm is appropriate for this?

I have 1000's of string. Given a pattern that need to be searched in all the string, and return all the string which contains that pattern.
Presently i am using vector for to store the original strings. searching for a pattern and if matches add it into new vector and finally return the vector.
int main() {
vector <string> v;
v.push_back ("maggi");
v.push_back ("Active Baby Pants Large 9-14 Kg ");
v.push_back ("Premium Kachi Ghani Pure Mustard Oil ");
v.push_back ("maggi soup");
v.push_back ("maggi sauce");
v.push_back ("Superlite Advanced Jar");
v.push_back ("Superlite Advanced");
v.push_back ("Goldlite Advanced");
v.push_back ("Active Losorb Oil Jar");
vector <string> result;
string str = "Advanced";
for (unsigned i=0; i<v.size(); ++i)
{
size_t found = v[i].find(str);
if (found!=string::npos)
result.push_back(v[i]);
}
for (unsigned j=0; j<result.size(); ++j)
{
cout << result[j] << endl;
}
// your code goes here
return 0;
}
Is there any optimum way to achieve the same with lesser complexity and higher performance ??
The containers I think are appropriate for your application.
However instead of std::string::find, if you implement your own KMP algorithm, then you can guarantee the time complexity to be linear in terms of the length of string + search string.
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
As such the complexity of std::string::find is unspecified.
http://www.cplusplus.com/reference/string/string/find/
EDIT: As pointed out by this link, if the length of your strings is not large (more than 1000), then probably using std::string::find would be good enough since here tabulation etc is not needed.
C++ string::find complexity
If the result is used in the same block of code as the input string vector (it is so in your example) or even if you have a guarantee that everyone uses the result only while input exists, you don't need actually to copy strings. It could be an expensive operation, which considerably slows total algorithm.
Instead you could have a vector of pointers as the result:
vector <string*> result;
If the list of strings is "fixed" for many searches then you can do some simple preprocessing to speed up things quite considerably by using an inverted index.
Build a map of all chars present in the strings, in other words for each possible char store a list of all strings containing that char:
std::map< char, std::vector<int> > index;
std::vector<std::string> strings;
void add_string(const std::string& s) {
int new_pos = strings.size();
strings.push_back(s);
for (int i=0,n=s.size(); i<n; i++) {
index[s[i]].push_back(new_pos);
}
}
Then when asked to search for a substring you first check for all chars in the inverted index and iterate only on the list in the index with the smallest number of entries:
std::vector<std::string *> matching(const std::string& text) {
std::vector<int> *best_ix = NULL;
for (int i=0,n=text.size(); i<n; i++) {
std::vector<int> *ix = &index[text[i]];
if (best_ix == NULL || best_ix->size() > ix->size()) {
best_ix = ix;
}
}
std::vector<std::string *> result;
if (best_ix) {
for (int i=0,n=best_ix->size(); i<n; i++) {
std::string& cand = strings[(*best_ix)[i]];
if (cand.find(text) != std::string::npos) {
result.push_back(&cand);
}
}
} else {
// Empty text as input, just return the whole list
for (int i=0,n=strings.size(); i<n; i++) {
result.push_back(&strings[i]);
}
}
return result;
}
Many improvements are possible:
use a bigger index (e.g. using pairs of consecutive chars)
avoid considering very common chars (stop lists)
use hashes computed from triplets or longer sequences
search the intersection instead of searching the shorter list. Given the elements are added in order the vectors are anyway already sorted and intersection could be computed efficently even using vectors (see std::set_intersection).
All of them may make sense or not depending on the parameters of the problem (how many strings, how long, how long is the text being searched ...).
If the source text is large and static (e.g. crawled webpages), then you can save search time by pre-building a suffix tree or a trie data structure. The search pattern can than traverse the tree to find matches.
If the source text is small and changes frequently, then your original approach is appropriate. The STL functions are generally very well optimized and have stood the test of time.