Erase inside a std::string by std::string_view - c++

I need to find and then erase a portion of a string (a substring). string_view seems such a good idea, but I cannot make it work with string::erase:
// guaranteed to return a view into `str`
auto gimme_gimme_gimme(const std::string& str) -> std::string_view;
auto after_midnight(std::string& str)
{
auto man = gimme_gimme_gimme(str);
str.erase(man); // way to hopeful, not a chance though
str.erase(man.begin(), man.end()); // nope
str.erase(std::distance(str.begin(), man.begin()), man.size()); // nope
str.erase(std::distance(str.data(), man.data()), man.size()); // nope again
// for real???
}
Am I overthinking this? Given a std::string_view into a std::string how to erase that part of the string? Or am I misusing string_view?

The string view could indeed be empty, or it could be a view to the outside of the container. Your suggested erase overload, as well as the implementation of the function in your answer relies on a pre-condition that the string view is to the same string object.
Of course, the iterator overloads are very much analogous and rely on the same pre-condition. But such pre-condition is conventional for iterators, but non-conventional for string views.
I don't think that string view is an ideal way to represent the sub range in this case. Instead, I would suggest using a relative sub range based on the indices. For example:
struct sub_range {
size_t begin;
size_t count;
constexpr size_t past_end() noexcept {
return begin + count;
}
};
It is a matter of taste whether to use end (i.e. past_end) or count for the second member, and to provide the other as a function. Regardless, there should be no confusion because the member will have a name. Using count is somewhat more conventional with indices.
Another choice is whether to use signed or unsigned indices. Signed indices can be used to represent backwards ranges. std::string interface doesn't understand such ranges however.
Example usage:
auto gimme_gimme_gimme(const std::string& str) -> sub_range;
auto after_midnight(std::string& str)
{
auto man = gimme_gimme_gimme(str);
str.erase(man.begin, man.distance);
}

Am I overthinking this?
You're under thinking it, unless I'm missing something obvious. To make the code compile you need this:
auto gimme_gimme_gimme(const std::string& str) -> std::string_view;
auto after_midnight(std::string& str)
{
auto man = gimme_gimme_gimme(str);
str.erase(std::distance(std::as_const(str).data(), man.data()), man.size()); // urrr... growling in pain
}
But wait!! There's more! Notice I said "to make it compile". The code is error prone!! Because...
std::string::data cannot be nullptr but an empty string_view can be represented as (valid pointer inside the string + size 0) or as (nullptr + size 0). The problem arises if the string_view::data is nulltpr because of the std::distance used.
So you need to make sure that the string_view always points inside the string, even if the view is empty. Or do extra checks on the erase side.

Related

Why does `std::string::find()` not return the end iterator on failures?

I find the behaviour of std::string::find to be inconsistent with standard C++ containers.
E.g.
std::map<int, int> myMap = {{1, 2}};
auto it = myMap.find(10); // it == myMap.end()
But for a string,
std::string myStr = "hello";
auto it = myStr.find('!'); // it == std::string::npos
Why shouldn't the failed myStr.find('!') return myStr.end() instead of std::string::npos?
Since the std::string is somewhat special when compared with other containers, I am wondering whether there is some real reason behind this.
(Surprisingly, I couldn't find anyone questioning this anywhere).
To begin with, the std::string interface is well known to be bloated and inconsistent, see Herb Sutter's Gotw84 on this topic. But nevertheless, there is a reasoning behind std::string::find returning an index: std::string::substr. This convenience member function operates on indices, e.g.
const std::string src = "abcdefghijk";
std::cout << src.substr(2, 5) << "\n";
You could implement substr such that it accepts iterators into the string, but then we wouldn't need to wait long for loud complaints that std::string is unusable and counterintuitive. So given that std::string::substr accepts indices, how would you find the index of the first occurence of 'd' in the above input string in order to print out everything starting from this substring?
const auto it = src.find('d'); // imagine this returns an iterator
std::cout << src.substr(std::distance(src.cbegin(), it));
This might also not be what you want. Hence we can let std::string::find return an index, and here we are:
const std::string extracted = src.substr(src.find('d'));
If you want to work with iterators, use <algorithm>. They allow you to the above as
auto it = std::find(src.cbegin(), src.cend(), 'd');
std::copy(it, src.cend(), std::ostream_iterator<char>(std::cout));
This is because std::string have two interfaces:
The general iterator based interface found on all containers
The std::string specific index based interface
std::string::find is part of the index based interface, and therefore returns indices.
Use std::find to use the general iterator based interface.
Use std::vector<char> if you don't want the index based interface (don't do this).

Why does std::set not have a "contains" member function?

I'm heavily using std::set<int> and often I simply need to check if such a set contains a number or not.
I'd find it natural to write:
if (myset.contains(number))
...
But because of the lack of a contains member, I need to write the cumbersome:
if (myset.find(number) != myset.end())
..
or the not as obvious:
if (myset.count(element) > 0)
..
Is there a reason for this design decision ?
I think it was probably because they were trying to make std::set and std::multiset as similar as possible. (And obviously count has a perfectly sensible meaning for std::multiset.)
Personally I think this was a mistake.
It doesn't look quite so bad if you pretend that count is just a misspelling of contains and write the test as:
if (myset.count(element))
...
It's still a shame though.
To be able to write if (s.contains()), contains() has to return a bool (or a type convertible to bool, which is another story), like binary_search does.
The fundamental reason behind the design decision not to do it this way is that contains() which returns a bool would lose valuable information about where the element is in the collection. find() preserves and returns that information in the form of an iterator, therefore is a better choice for a generic library like STL. This has always been the guiding principle for Alex Stepanov, as he has often explained (for example, here).
As to the count() approach in general, although it's often an okay workaround, the problem with it is that it does more work than a contains() would have to do.
That is not to say that a bool contains() isn't a very nice-to-have or even necessary. A while ago we had a long discussion about this very same issue in the
ISO C++ Standard - Future Proposals group.
It lacks it because nobody added it. Nobody added it because the containers from the STL that the std library incorporated where designed to be minimal in interface. (Note that std::string did not come from the STL in the same way).
If you don't mind some strange syntax, you can fake it:
template<class K>
struct contains_t {
K&& k;
template<class C>
friend bool operator->*( C&& c, contains_t&& ) {
auto range = std::forward<C>(c).equal_range(std::forward<K>(k));
return range.first != range.second;
// faster than:
// return std::forward<C>(c).count( std::forward<K>(k) ) != 0;
// for multi-meows with lots of duplicates
}
};
template<class K>
containts_t<K> contains( K&& k ) {
return {std::forward<K>(k)};
}
use:
if (some_set->*contains(some_element)) {
}
Basically, you can write extension methods for most C++ std types using this technique.
It makes a lot more sense to just do this:
if (some_set.count(some_element)) {
}
but I am amused by the extension method method.
The really sad thing is that writing an efficient contains could be faster on a multimap or multiset, as they just have to find one element, while count has to find each of them and count them.
A multiset containing 1 billion copies of 7 (you know, in case you run out) can have a really slow .count(7), but could have a very fast contains(7).
With the above extension method, we could make it faster for this case by using lower_bound, comparing to end, and then comparing to the element. Doing that for an unordered meow as well as an ordered meow would require fancy SFINAE or container-specific overloads however.
You are looking into particular case and not seeing bigger picture. As stated in documentation std::set meets requirement of AssociativeContainer concept. For that concept it does not make any sense to have contains method, as it is pretty much useless for std::multiset and std::multimap, but count works fine for all of them. Though method contains could be added as an alias for count for std::set, std::map and their hashed versions (like length for size() in std::string ), but looks like library creators did not see real need for it.
Although I don't know why std::set has no contains but count which only ever returns 0 or 1,
you can write a templated contains helper function like this:
template<class Container, class T>
auto contains(const Container& v, const T& x)
-> decltype(v.find(x) != v.end())
{
return v.find(x) != v.end();
}
And use it like this:
if (contains(myset, element)) ...
The true reason for set is a mystery for me, but one possible explanation for this same design in map could be to prevent people from writing inefficient code by accident:
if (myMap.contains("Meaning of universe"))
{
myMap["Meaning of universe"] = 42;
}
Which would result in two map lookups.
Instead, you are forced to get an iterator. This gives you a mental hint that you should reuse the iterator:
auto position = myMap.find("Meaning of universe");
if (position != myMap.cend())
{
position->second = 42;
}
which consumes only one map lookup.
When we realize that set and map are made from the same flesh, we can apply this principle also to set. That is, if we want to act on an item in the set only if it is present in the set, this design can prevent us from writing code as this:
struct Dog
{
std::string name;
void bark();
}
operator <(Dog left, Dog right)
{
return left.name < right.name;
}
std::set<Dog> dogs;
...
if (dogs.contain("Husky"))
{
dogs.find("Husky")->bark();
}
Of course all this is a mere speculation.
Since c++20,
bool contains( const Key& key ) const
is available.
I'd like to point out , as mentioned by Andy, that since C++20 the standard added the contains Member function for maps or set:
bool contains( const Key& key ) const; (since C++20)
Now I'd like to focus my answer regarding performance vs readability.
In term of performance if you compare the two versions:
#include <unordered_map>
#include <string>
using hash_map = std::unordered_map<std::string,std::string>;
hash_map a;
std::string get_cpp20(hash_map& x,std::string str)
{
if(x.contains(str))
return x.at(str);
else
return "";
};
std::string get_cpp17(hash_map& x,std::string str)
{
if(const auto it = x.find(str); it !=x.end())
return it->second;
else
return "";
};
You will find that the cpp20 version takes two calls to std::_Hash_find_last_result while the cpp17 takes only one call.
Now I find myself with many data structure with nested unordered_map.
So you end up with something like this:
using my_nested_map = std::unordered_map<std::string,std::unordered_map<std::string,std::unordered_map<int,std::string>>>;
std::string get_cpp20_nested(my_nested_map& x,std::string level1,std::string level2,int level3)
{
if(x.contains(level1) &&
x.at(level1).contains(level2) &&
x.at(level1).at(level2).contains(level3))
return x.at(level1).at(level2).at(level3);
else
return "";
};
std::string get_cpp17_nested(my_nested_map& x,std::string level1,std::string level2,int level3)
{
if(const auto it_level1=x.find(level1); it_level1!=x.end())
if(const auto it_level2=it_level1->second.find(level2);it_level2!=it_level1->second.end())
if(const auto it_level3=it_level2->second.find(level3);it_level3!=it_level2->second.end())
return it_level3->second;
return "";
};
Now if you have plenty of condition in-between these ifs, using the iterator really is painful, very error prone and unclear, I often find myself looking back at the definition of the map to understand what kind of object was at level 1 or level2, while with the cpp20 version , you see at(level1).at(level2).... and understand immediately what you are dealing with.
So in term of code maintenance/review, contains is a very nice addition.
What about binary_search ?
set <int> set1;
set1.insert(10);
set1.insert(40);
set1.insert(30);
if(std::binary_search(set1.begin(),set1.end(),30))
bool found=true;
contains() has to return a bool. Using C++ 20 compiler I get the following output for the code:
#include<iostream>
#include<map>
using namespace std;
int main()
{
multimap<char,int>mulmap;
mulmap.insert(make_pair('a', 1)); //multiple similar key
mulmap.insert(make_pair('a', 2)); //multiple similar key
mulmap.insert(make_pair('a', 3)); //multiple similar key
mulmap.insert(make_pair('b', 3));
mulmap.insert({'a',4});
mulmap.insert(pair<char,int>('a', 4));
cout<<mulmap.contains('c')<<endl; //Output:0 as it doesn't exist
cout<<mulmap.contains('b')<<endl; //Output:1 as it exist
}
Another reason is that it would give a programmer the false impression that std::set is a set in the math set theory sense. If they implement that, then many other questions would follow: if an std::set has contains() for a value, why doesn't it have it for another set? Where are union(), intersection() and other set operations and predicates?
The answer is, of course, that some of the set operations are already implemented as functions in (std::set_union() etc.) and other are as trivially implemented as contains(). Functions and function objects work better with math abstractions than object members, and they are not limited to the particular container type.
If one need to implement a full math-set functionality, he has not only a choice of underlying container, but also he has a choice of implementation details, e.g., would his theory_union() function work with immutable objects, better suited for functional programming, or would it modify its operands and save memory? Would it be implemented as function object from the start or it'd be better to implement is a C-function, and use std::function<> if needed?
As it is now, std::set is just a container, well-suited for the implementation of set in math sense, but it is nearly as far from being a theoretical set as std::vector from being a theoretical vector.

Get an iterator from a char pointer (C++)

I am challenging myself to write a Palindrome tester using only SL algorithms, iterators etc. I also want to program to work with raw strings. Below, I use the raw pointer pal in the copy_if algorithm, but instead, how could I define an iterator to go here, i.e. by using something like begin(pal) and end(pal + size)?
#include <algorithm>
#include <iterator>
#include <cctype>
using namespace std;
bool isPalindrome(const char* pal) {
if (!pal) { return(false); }
int size = strlen(pal);
string pal_raw;
pal_raw.reserve(size);
// Copy alphabetical chars only (no spaces, punctuations etc.) into pal_raw
copy_if(pal, pal+size, back_inserter(pal_raw),
[](char item) {return isalpha(item); }
);
// Test if palindromic, ignoring capitalisation
bool same = equal(begin(pal_raw), end(pal_raw), rbegin(pal_raw), rend(pal_raw),
[](char item1, char item2) {return tolower(item1) == tolower(item2); }
);
return same;
}
int main(){
char pal[] = "Straw? No, too stupid a fad. I put soot on warts.";
bool same = isPalindrome(pal);
return 0;
}
Bonus Question: Is it possible to eliminate the need to copy_if() by incrementing the iterators 'in place' from within equal() i.e. when !isalpha(item)?
Iterators implement the concept of pointers, when it comes to C++ library algorithms. And, as you've discovered, C++ library algorithms that take iterators are perfectly happy to also take pointers. It's the same concept.
And when you already have pointers to begin with there is no iterator, of some kind, that the pointers can be converted to.
It is true that
std::begin(arr)
and
std::end(arr)
are defined on flat arrays. But, guess what: they return a pointer to the beginning and the end of the array, and not an iterator class of some kind.
However, you cannot use std::begin(), and std::end() because by the time you need to use it, inside your function, the array was already decayed to a char *. std::begin() and std::end() works on real arrays, and not decayed pointers.
If you insist on using iterators, you should pass a std::string to your palindrome function, instead of a char *. std::string implements a begin() and an end() method that return a std::string::iterator, that you can use.
If I'd do that and I'd want to make it work for different types, I'd templatize the method - for example:
template<class ITER_T>
bool isPalindrome(ITER_T begin, ITER_T end) {
// check for palindrome generically between begin and end
}
That way it would work for const char * iterators, std::string, std::vector<char>, std::map<char> with the same code. And if implemented properly, even for other types of vectors, maps and anything else which you can get iterator for and the item type has a comparison operator defined.
As a bonus, I could then also check if a part of character array, string or a vector is a palindrome.
By the way, this:
equal(begin(pal_raw), end(pal_raw), rbegin(pal_raw), rend(pal_raw), ...
is unnecessarily checking each character twice, if the string is a palindrome ... not very efficient. In this case you could do perhaps better with a "manual" loop (not sure if there is a std algo for that).
If you'd like to make the begin(arr)/end(arr) work inside the overloaded function, you'd need to templatize it anyway, like this:
template<size_t N>
bool isPalindrome(const char (&arr)[N]) {
...
However then you get separate instantiation for each different array size. So it is better anyway to templatize using the iterators and only get a single instantiation for any char array size.
So to answer the "bonus question", it is indeed possible to avoid creating the temporary string (i.e. dynamic memory allocation) at all, by iterating over the array directly:
template<typename ITER_T>
bool isPalindrome(ITER_T begin, ITER_T end) {
while (begin < end) {
if (tolower(*begin++) != tolower(*--end))
return false;
}
return true;
}
bool same = isPalindrome(begin(pal), end(pal));
To test for isalpha I leave to you for your practice. Hint: you can do that before the equality check and increment/decrement the begin/end appropriately (hint #2: the solution I have in mind will use the continue keyword).
The same to make it work with arbitrary type different from char - then type traits can be used to abstract out the isalpha/tolower calls via template specializations.

Effective construction std::string from std::unordered_set<char>

I have an unordered_set of chars
std::unordered_set<char> u_setAlphabet;
Then I want to get a content from the set as std::string. My implementation now looks like this:
std::string getAlphabet() {
std::string strAlphabet;
for (const char& character : u_setAlphabet)
strAlphabet += character;
return strAlphabet;
}
Is this a good way to solve this task? The additions of signle chars to string seems not to be optimal for large u_setAlphabet (multiple reallocs?). Is there any other method to it?
The simplest, most readable and most efficient answer is:
return std:string(s.begin(), s.end());
The implementation may choose to detect the length of the range up-front and only allocate once; both libc++ and libstdc++ do this when given a forward iterator range.
The string class also offers you reserve, just like vector, to manage the capacity:
std::string result
result.reserve(s.size());
for (unsigned char c : s) result.push_back(c); // or std::copy
return result;
It also offers assign, append and insert member functions, but since those offer the strong exception guarantee, they may have to allocate a new buffer before destroying the old one (thanks to #T.C. for pointing out this crucial detail!). The libc++ implementation does not reallocate if the existing capacity suffices, while GCC5's libstdc++ implementation reallocates unconditionally.
std::string has a constructor for that:
auto s = std::string(begin(u_setAlphabet), end(u_setAlphabet));
It is better to use the constructor that acepts iterators. For example
std::string getAlphabet() {
return { u_setAlphabet.begin(), u_setAlphabet.end() };
}
Both return std::string(u_setAlphabet.begin(), u_setAlphabet.end()); and return { u_setAlphabet.begin(), u_setAlphabet.end(); are the same in C++11. I prefer #VladfromMoscow solution because we do not need to make any assumption about the returned type of the temporary object.

Interpret a std::string as a std::vector of char_type?

I have a template<typename T> function that takes a const vector<T>&. In said function, I have vectors cbegin(), cend(), size(), and operator[].
As far as I understand it, both string and vector use contiguous space, so I was wondering if I could reuse the function for both data types in an elegant manner.
Can a std::string be reinterpreted as a std::vector of (the appropriate) char_type? If so, what would the limitations be?
If you make your template just for type const T& and use the begin(), end(), etc, functions which both vector and string share then your code will work with both types.
Go STL way and use iterators. Accept iterator to begin and iterator to end. It will work with all possible containers, including non-containers like streams.
There is no guarantee the layout of string and vector will be the same. They theoretically could be, but they probably aren't in any common implementation. Therefore, you can't do this safely. See Zan's answer for a better solution.
Let me explain: If I am a standard library implementer and decide to implement std::string like so....
template ...
class basic_string {
public:
...
private:
CharT* mData;
size_t mSize;
};
and decide to implement std::vector like so...
template ...
class vector {
public:
...
private:
T* mEnd;
T* mBegin;
};
When you reinterpret_cast<string*>(&myVector) you wind up interpreting the pointer to the end of your data as the pointer to the start of your data, and the pointer to the start of your data to the size of your data. If the padding between members is different, or there are extra members, it could get even weirder and more broken than that too.
So yes, in order for this to possibly work they both need to store contiguous data, but they also need quite a bit else to be the same between the implementations for it to work.
std::experimental::array_view<const char> n4512 represents a contiguous buffer of chars.
Writing your own is not hard, and it solves this problem and (in my experience) many more.
Both string and vector are compatible with an array view.
This lets you move your implementation into a .cpp file (and not expose it), gives you the same performance as doing it with std::vector<T> const& and probably the same implementation, avoids duplicating code, and uses light weight contiguous buffer type erasure (which is full of tasty keywords).
If the key point is that you want to access a continuous area in memory where instances of a specific char type are stored then you could define your function as
void myfunc(const CType *p, int size) {
...
}
to make it clear that you assume they must be adjacent in memory.
Then for example to pass the content of a vector the code is simply
myfunc(&myvect[0], myvect.size());
and for a string
myfunc(mystr.data(), mystr.size());
or
myfunc(buffer, n);
for an array.
You can't directly typecast a std::vector to a std::string or vice versa. But using the iterators that STL containers provide does allow you to iterate both a vector and a string in the same way. And if your function requires random access of the container in question then either would work.
std::vector<char> str1 {'a', 'b', 'c'};
std::string str2 = "abc";
template<typename Iterator>
void iterator_function(Iterator begin, Iterator end)
{
for(Iterator it = begin; it != end; ++it)
{
std::cout << *it << std::endl;
}
}
iterator_function(str1.begin(), str1.end());
iterator_function(str2.begin(), str2.end());
Both of those last two function calls would print the same thing.
Now if you wanted to write a generic version that parsed only characters only stored in a string or in a vector you could write something that iterated the internal array.
void array_function(const char * array, unsigned length)
{
for(unsigned i = 0; i < length; ++i)
{
std::cout << array[i] << std::endl;
}
}
Both functions would do the same thing in the following scenarios.
std::vector<char> str1 {'a', 'b', 'c'};
std::string str2 = "abc";
iterator_function(str1.begin(), str1.end());
iterator_function(str2.begin(), str2.end());
array_function(str1.data(), str1.size());
array_function(str2.data(), str2.size());
There are always multiple ways to solve a problem. Depending on what you have available any number of solutions might work. Try both and see which works better for your application. If you don't know the iterator type then the char typed array iteration is useful. If you know you will always have the template type to pass in then the template iterator method might be more useful.
The way your question is put at the moment is a bit confusing. If you mean to be asking "is it safe to cast a std::vector type to a std::string type or vice versa if the vector happens to contain char values of the appropriate type?", the answer is: no way, don't even think about it! If you're asking: "can I access the contiguous memory of non-empty sequences of char type if they're of the type std::vector or std::string?" then the answer is, yes you can (with the data() member function).