Remove repeating characters from string

Remove repeating characters from string - c++

I have a string, like e.g. acaddef or bbaaddgg. I have to remove from it, as fast as possible, all repeating characters. So, for example, pooaatat after should look like poat and ggaatpop should look like gatpo. Is there any built-in function or algorithm to do that quickly? I tried to search STL, but without satisfaing result.

Okay, so here are 4 different solutions.
Fixed Array
std::string str = "pooaatat";
// Prints "poat"
short count[256] = {0};
std::copy_if(str.begin(), str.end(), std::ostream_iterator<char>(std::cout),
[&](unsigned char c) { return count[c]++ == 0; });
Count Algorithm + Iterator
std::string str = "pooaatat";
// Prints "poat"
std::string::iterator iter = str.begin();
std::copy_if(str.begin(), str.end(), std::ostream_iterator<char>(std::cout),
[&](char c) { return !std::count(str.begin(), iter++, c); });
Unordered Set
std::string str = "pooaatat";
// Prints "poat"
std::unordered_set<char> container;
std::copy_if(str.begin(), str.end(), std::ostream_iterator<char>(std::cout),
[&](char c) { return container.insert(c).second; });
Unordered Map
std::string str = "pooaatat";
// Prints "poat"
std::unordered_map<char, int> container;
std::copy_if(str.begin(), str.end(), std::ostream_iterator<char>(std::cout),
[&](char c) { return container[c]++ == 0; });

AFAIK, there is no built-in algorithm for doing this. The std::unique algorithm is valid if you want to remove only consecutive duplicate characters.
However you can follow the following simple approach:
If the string contains only ASCII characters, you can form a boolean array A[256] denoting whether the respective character has been encountered already or not.
Then simply traverse the input string and copy the character to output if A[character] is still 0 (and make A[character] = 1).
In case the string contains arbitrary characters, then you can use a std::unordered_map or a std::map of char to int.

Built-in regular expressions should be efficient, i.e.
#include <regex>
[...]
const std::regex pattern("([\\w ])(?!\\1)");
string s = "ssha3akjssss42jj 234444 203488842882387 heeelloooo";
std::string result;
for (std::sregex_iterator i(s.begin(), s.end(), pattern), end; i != end; ++i)
result.append((*i)[1]);
std::cout << result << std::endl;
Of course, you can modify the cpaturing group to your needs.
The good thing is that it is supported in Visual Studio 2010 tr1 already. gcc 4.8, however, seems to have a problem with regex iterators.

Related

How to erase non-alpha chars and lowercase the alpha chars in a single pass of a string?

Given a string:
std::string str{"This i_s A stRIng"};
Is it possible to transform it to lowercase and remove all non-alpha characters in a single pass?
Expected result:
this is a string
I know you can use std::transform(..., ::tolower) and string.erase(remove_if()) combination to make two passes, or it can be done manually by iterating each character, but is there a way to do something that would combine the std::transform and erase calls without having to run through the string multiple times?

C++20 ranges allow algorithms to be combined, in a one-pass fashion.
std::string str{"This i_s A stRIng"};
std::string out;
auto is_alpha_or_space = [](unsigned char c){ return isalpha(c) || isspace(c); };
auto safe_tolower = [](unsigned char c){ return tolower(c); };
std::ranges::copy( str
| std::views::filter(is_alpha_or_space)
| std::views::transform(safe_tolower)
, std::back_inserter(out));
See it on Compiler Explorer

First let me notice that you seem to want to filter alphabetic characters or spaces, that is, characters l for which std::isalpha(l) || std::isspace(l) returns true.
Assuming this, you can achieve what you want using std::accumulate
str = std::accumulate(str.begin(), str.end(), std::string{},
[](const std::string& s, const auto& l) {
if (std::isalpha(l, std::locale()) || std::isspace(l, std::locale()))
return s + std::tolower(l, std::locale());
else
return s;
});
See it Live on Coliru.

I know you can ... do it manually by iterating each character ...
That is exactly would you would have to do, eg:
std::string str{"This i_s A stRIng"};
std::string::size_type pos = 0;
while (pos < str.size())
{
unsigned char ch = str[pos];
if (!::isalpha(ch))
{
if (!::isspace(ch))
{
str.erase(pos, 1);
continue;
}
}
else
{
str[pos] = (char) ::tolower(ch);
}
++pos;
}
Or:
std::string str{"This i_s A stRIng"};
auto iter = str.begin();
while (iter != str.end())
{
unsigned char ch = *iter;
if (!::isalpha(ch))
{
if (!::isspace(ch))
{
iter = str.erase(iter);
continue;
}
}
else
{
*iter = (char) ::tolower(ch);
}
++iter;
}
but is there a way to do something that would combine the std::transform and erase calls without having to run through the string multiple times?
You can use the standard std::accumulate() algorithm, as shown in francesco's answer. Although, that will not manipulate the std::string in-place, as the code above does. It will create a new std::string instead (and will do so on each iteration, for that matter).
Otherwise, you could use C++20 ranges, ie by combining std::views::filter() with std::views::transform(), eg (I'm not familiar with the <ranges> library, so this syntax might need tweaking):
#include <ranges>
auto alphaOrSpace = [](unsigned char ch){ return ::isalpha(ch) || ::isspace(ch); }
auto lowercase = [](unsigned char ch){ return ::tolower(ch); };
std::string str{"This i_s A stRIng"};
str = str | std::views::filter(alphaOrSpace) | std::views::transform(lowercase);
But, this would actually be a multi-pass solution, just coded into a single operation.

Starting loop at specific index of a std::string?

I wrote the following function:
std::regex r("");
for (std::sregex_iterator i = words_begin; i != words_end; ++i) {}
It starts looking for regex matches from the beginning of the given string (str) But how may I tell it to exclude everything before specific index?
For example I want it to delete with all of what comes after index number 4 (Not including it).
Note: I am calling this code from another function so I tried something like str + 4 in the string parameter but I got an error that it's not l-value.

If I understand your question correctly you can pass a parameter to the function with the position where you'd like to start the search, and use it to set the iterator:
void print_str(const std::string& str, int pos)
{
std::regex r("\\{[^}]*\\}");
auto words_begin =
std::sregex_iterator(str.begin() + pos, str.end(), r);
//...
}
int main()
{
std::string str = "somestring";
func_str(str, 4);
}
Or pass the iterators themselves, one to the position you'd like to start the search and one to the end of the string:
void func_str(std::string::iterator it_begin, std::string::iterator it_end)
{
std::regex r("\\{[^}]*\\}");
auto words_begin =
std::sregex_iterator(it_begin, it_end, r);
//...
}
int main()
{
std::string str = "somestring";
func_str(str.begin() + 4, str.end());
}
As #bruno correctly stated, you may use str.substr(4) not str + 4, as an argument instead of the original string, the downside of the method is that it will create unnecessary copies of the string to be searched, as #Marek also correctly pointed out, thus the options of passing a position or begin and end iterators is less expensive. The upside is that you would not have to change anything in the function.

I suggest checking the std::smatch#position() to determine if the match is to be taken or discarded:
#include <iostream>
#include<regex>
int main() {
std::regex r("\\{[^}]*\\}");
std::string str("{1}, {2} and {3}");
auto words_begin =
std::sregex_iterator(str.begin(), str.end(), r);
auto words_end = std::sregex_iterator();
for (std::sregex_iterator i = words_begin; i != words_end; ++i) {
std::smatch m = *i;
if (m.position() > 4) {
std::cout << m.str() << std::endl;
}
}
return 0;
}
See the C++ demo online. Adjust the if condition as you need.
Here, the first {1} match is discarded since its position was less or equal than 4.

How to get a word vector from a string?

I want to store words separated by spaces into single string elements in a vector.
The input is a string that may end or may not end in a symbol( comma, period, etc.)
All symbols will be separated by spaces too.
I created this function but it doesn't return me a vector of words.
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (size_t character = 0; character < sentence.size(); ++character)
{
if (sentence[character] == ' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
return word_vector;
}
What did I do wrong?

Your problem has already been resolved by answers and comments.
I would like to give you the additional information that such functionality is already existing in C++.
You could take advantage of the fact that the extractor operator extracts space separated tokens from a stream. Because a std::string is not a stream, we can put the string first into an std::istringstream and then extract from this stream vie the std:::istream_iterator.
We could life make even more easier.
Since roundabout 10 years we have a dedicated, special C++ functionality for splitting strings into tokens, explicitely designed for this purpose. The std::sregex_token_iterator. And because we have such a dedicated function, we should simply use it.
The idea behind it is the iterator concept. In C++ we have many containers and always iterators, to iterate over the similar elements in these containers. And a string, with similar elements (tokens), separated by a delimiter, can also be seen as such a container. And with the std::sregex:token_iterator, we can iterate over the elements/tokens/substrings of the string, splitting it up effectively.
This iterator is very powerfull and you can do really much much more fancy stuff with it. But that is too much for here. Important is that splitting up a string into tokens is a one-liner. For example a variable definition using a range constructor for iterating over the tokens.
See some examples below:
#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>
#include <algorithm>
#include <regex>
const std::regex delimiter{ " " };
const std::regex reWord{ "(\\w+)" };
int main() {
// Some debug print function
auto print = [](const std::vector<std::string>& sv) -> void {
std::copy(sv.begin(), sv.end(), std::ostream_iterator<std::string>(std::cout, "\n")); std::cout << "\n"; };
// The test string
std::string test{ "word1 word2 word3 word4." };
//-----------------------------------------------------------------------------------------
// Solution 1: use istringstream and then extract from there
std::istringstream iss1(test);
// Define a vector (CTAD), use its range constructor and, the std::istream_iterator as iterator
std::vector words1(std::istream_iterator<std::string>(iss1), {});
print(words1); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 2: directly use dedicated function sregex_token iterator
std::vector<std::string> words2(std::sregex_token_iterator(test.begin(), test.end(), delimiter, -1), {});
print(words2); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 3: directly use dedicated function sregex_token iterator and look for words only
std::vector<std::string> words3(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {});
print(words3); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 4: Use such iterator in an algorithm, to copy data to a vector
std::vector<std::string> words4{};
std::copy(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {}, std::back_inserter(words4));
print(words4); // Show debug output
//-----------------------------------------------------------------------------------------
// Solution 5: Use such iterator in an algorithm for direct output
std::copy(std::sregex_token_iterator(test.begin(), test.end(), reWord, 1), {}, std::ostream_iterator<std::string>(std::cout,"\n"));
return 0;
}

You added the index instead of the character:
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (size_t i = 0; i < sentence.size(); ++i)
{
char character = sentence[i];
if (character == ' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
return word_vector;
}

Since your mistake was only due to the reason, that you named your iterator variable character even though it is actually not a character, but rather an iterator or index, I would like to suggest to use a ranged-base loop here, since it avoids this kind of confusion. The clean solution is obviously to do what #ArminMontigny said, but I assume you are prohibited to use stringstreams. The code would look like this:
#include <iostream>
#include <string>
#include <vector>
using namespace std;
vector<string> single_words(string sentence)
{
vector<string> word_vector;
string result_word;
for (char& character: sentence) // Now `character` is actually a character.
{
if (character==' ' && result_word.size() != 0)
{
word_vector.push_back(result_word);
result_word = "";
}
else
result_word += character;
}
word_vector.push_back(result_word); // In your solution, you forgot to push the last word into the vector.
return word_vector;
}
int main() {
string sentence="Maybe try range based loops";
vector<string> result= single_words(sentence);
for(string& word: result)
cout<<word<<" ";
return 0;
}

Using range-v3 to read comma separated list of numbers

I'd like to use Ranges (I use range-v3 implementation) to read a input stream that is a comma separated list of numbers. That is trivial to do without ranges but...
This is what I thought was the straight-forward way to solve it:
auto input = std::istringstream("42,314,11,0,14,-5,37");
auto ints = ranges::istream_view<int>(input) | ranges::view::split(",");
for (int i : ints)
{
std::cout << i << std::endl;
}
But this fails to compile. I've tried a number of variations of this but nothing seem to work, I guess this is wrong in several ways. Can someone please enlighten me what I am doing wrong and explain how this should be done instead?
Thanks in advance!

What
ranges::istream_view<int>(input)
does is produce a range that is the rough equivalent of this coroutine (even if you don't understand C++20 coroutines, hopefully this example is simple enough that it gets the point across):
generator<int> istream_view_ints(istream& input) {
int i;
while (input >> i) { // while we can still stream int's out
co_yield i; // ... yield the next int
}
}
Two important points here:
This is range of ints, so you cannot split it on a string.
This uses the normal stream >>, which does not allow you to provide your own delimiter - it only stops at whitespace.
Altogether, istream_view<int>(input) gives you a range of ints that, on your input, consists of a single int: just 42. The next input would try to read in the , and fail.
In order to get a delimited input, you can use getlines. That will give you a range of string with the delimiter you provide. It uses std::getline internally. Effectively, it's this coroutine:
generator<string> getlines(istream& input, char delim = '\n') {
string s;
while (std::getline(input, s, delim)) {
co_yield s;
}
}
And then you need to convert those strings to ints. Something like this should do the trick:
auto ints = ranges::getlines(input, ',')
| ranges::view::transform([](std::string const& s){ return std::stoi(s); });

std::string input = "42,314,11,0,14,-5,37";
auto split_view = ranges::view::split(input, ",");
would produce a range of ranges:
{{'4', '2'}, {'3', '1', '4'}, {'1', '1'}, {'0'}, {'1', '4'}, {'-', '5'}, {'3', '7'}}.
so you might do:
std::string input = "42,314,11,0,14,-5,37";
auto split_view = ranges::view::split(input, ",");
for (auto chars : split_view) {
for (auto c : chars) {
std::cout << c;
}
std::cout << std::endl;
}

how to remove substring from string c++

I have a string s="home/dir/folder/name"
I want to split s in s1="home/dir/folder" and s2="name";
I did:
char *token = strtok( const_cast<char*>(s.c_str() ), "/" );
std::string name;
std::vector<int> values;
while ( token != NULL )
{
name=token;
token = strtok( NULL, "/" );
}
now s1=name. What about s2?

I'd recommend against using strtok. Take a look at Boost Tokenizer instead (here are some examples).
Alternatively, to simply find the position of the last '/', you could use std::string::rfind:
#include <string>
#include <iostream>
int main() {
std::string s = "home/dir/folder/name";
std::string::size_type p = s.rfind('/');
if (p == std::string::npos) {
std::cerr << "s contains no forward slashes" << std::endl;
} else {
std::string s1(s, 0, p);
std::string s2(s, p + 1);
std::cout << "s1=[" << s1 << "]" << std::endl;
std::cout << "s2=[" << s2 << "]" << std::endl;
}
return 0;
}

If your goal is only to get the position of the last \ or / in your string, you might use string::find_last_of which does exactly that.
From there, you can use string::substr or the constructor for std::string that takes iterators to get the sub-part you want.
Just make sure the original string contains at least a \ or /, or that you handle the case properly.
Here is a function that does what you need and returns a pair containing the two parts of the path. If the specified path does not contain any \ or / characters, the whole path is returned as a second member of the pair and the first member is empty. If the path ends with a / or \, the second member is empty.
using std::pair;
using std::string;
pair<string, string> extract_filename(const std::string& path)
{
const string::size_type p = path.find_last_of("/\\");
// No separator: a string like "filename" is assumed.
if (p == string::npos)
return pair<string, string>("", path);
// Ends with a separator: a string like "my/path/" is assumed.
if (p == path.size())
return pair<string, string(path.substr(0, p), "");
// A string like "my/path/filename" is assumed.
return pair<string, string>(path.substr(0, p), path.substr(p + 1));
}
Of course you might as well modify this function to throw an error instead of gracefully exiting when the path does not have the expected format.

Several points: first, your use of strtok is undefined behavior; in
the case of g++, it could even lead to some very strange behavior. You
cannot modify the contents of an std::string behind the strings back
and expect to get away with it. (The necessity of a const_cast should
have tipped you off.)
Secondly, if you're going to be manipulating filenames, I'd strongly
recommend boost::filesystem. It knows all about things like path
separators and the like, and the fact that the last component of a path
is generally special (since it may be a filename, and not a directory).
Thirdly, if this is just a one-of, or for some other reason you can't or
don't want to use Boost:
std::string::const_iterator pivot
= std::find( s.rbegin(), s.rend(), '/' ).base();
will give you an iterator to the first character after the last '/', or
to the first character in the string if there isn't one. After that,
it's a simple to use the two iterator constructors of string to get the
two components:
std::string basename( pivot, s.end() );
std::string dirname( s.begin(), pivot == s.begin() ? pivot : pivot - 1 );
And if you later have to support Windows, just replace the find with:
static std::string const dirSeparators( "\\/" );
std::string::const_iterator pivot
= std::find_first_of( s.rbegin(), s.rend(),
dirSeparators.begin(), dirSeparators.end() );

Check out boost string split.
Example:
string str1("hello abc-*-ABC-*-aBc goodbye");
typedef vector< iterator_range<string::iterator> > find_vector_type;
find_vector_type FindVec; // #1: Search for separators
ifind_all( FindVec, str1, "abc" ); // FindVec == { [abc],[ABC],[aBc] }
typedef vector< string > split_vector_type;
split_vector_type SplitVec; // #2: Search for tokens
split( SplitVec, str1, is_any_of("-*"), token_compress_on );
// SplitVec == { "hello abc","ABC","aBc goodbye" }

You can't use strtok on std::string.
strtok would modify the string. It break the c_str() contract.
Doing const_cast<> is a big sign for error.

Just use the string methods:
std::string s="home/dir/folder/name"
std::string::size_type n = s.find_last_of("/");
std::string s1 = s.substr(0,n);
if (n != std::string::npos) // increment past the '/' if we found it
{ ++n;
}
std::string s2 = s.substr(n);
Two bits of advice:
Don't use strtok EVER
If you are playing with file system paths look at boost::filesystem
If you want to play generally with tokenization use the stream operators
Or boost::tokenizer

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Remove repeating characters from string - c++

Related

How to erase non-alpha chars and lowercase the alpha chars in a single pass of a string?

Starting loop at specific index of a std::string?

How to get a word vector from a string?

Using range-v3 to read comma separated list of numbers

how to remove substring from string c++

Categories

Resources