String replacement in C++ on string of arbitrary length - c++

I have a string I get from ostringstream. I'm currently trying to replace some characters in this string (content.replace(content.begin(), content.end(), "\n", "");) but sometimes I get an exception:
malloc: *** mach_vm_map(size=4294955008) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
std::bad_alloc
I suspect that this happens because the string is too big. What's the best practice for these situations? Declare the string on the heap?
Update
My full method:
xml_node HTMLDocument::content() const {
xml_node html = this->doc.first_child();
xml_node body = html.child("body");
xml_node section = body.child("section");
std::ostringstream oss;
if (section.type() != xml_node_type::node_null) {
section.print(oss);
} else {
body.print(oss);
}
string content;
content = oss.str();
content.replace(content.begin(), content.end(), "<section />", "<section></section>");
content.replace(content.begin(), content.end(), "\t", "");
xml_node node;
return node;
}

There is no std::string::replace member function's overload that accepts a pair of iterators, a const char* to be searched for and const char* to be used as replacement, and this is where your problem comes from:
content.replace(content.begin(), content.end(), "\n", "");
matches the following overload:
template <class InputIterator>
string& replace(iterator i1, iterator i2,
InputIterator first, InputIterator last);
that is, "\n" and "" is treated as the range <first; last), which, depending on what addresses do they have, crashes your program or not.
You have to either use std::regex or implement your own logic that iterates through std::string and replaces any encountered pattern with a replacement string.

The lines:
content.replace(content.begin(), content.end(), "<section />", "<section></section>");
content.replace(content.begin(), content.end(), "\t", "");
result in undefined behavior. They match the function:
template<class InputIterator>
std::string& std::string::replace(
const_iterator i1, const_iterator i2,
InputIterator j1, InputIterator j2);
with InputIterator resolving to char const*. The problem is
that the distance between the two iterators, and whether the
second can be reached from the first, is undefined, since they
point to totally unrelated bits of memory.
From your code, I don't think you understand what
std::string::replace does. It replaces the range [i1,i2) in
the string with the text defined by the range [j1,j2). It
does not do any search and comparison; it is for use after
you have found the range which needs replacing. Calling:
content.replace(content.begin(), content.end(), "<section />", "<section></section>");
has exactly the same effect as:
content = std::string( "<section />", "<section></section>");
, which is certainly not what you want.
In C++11, there's a regex_replace function which may be of
some use, although if you're really doing this on very large
strings, it may not be the most performant (the added
flexibility of regular expressions comes at a price); I'd
probably use something like:
std::string
searchAndReplace(
std::string const& original,
std::string const& from,
std::string const& to)
{
std::string results;
std::string::const_iterator current = original.begin();
std::string::const_iterator end = original.end();
std::string::const_iterator next = std::search( current, end, from.begin(), from.end() );
while ( next != end ) {
results.append( current, next );
results.append( to );
current = next + from.size();
next = std::search( current, end, from.begin(), from.end() );
}
results.append( current, next );
return results;
}
For very large strings, some heuristic for guessing the size,
and then doing a reserve on results is probably a good idea
as well.
Finally, since your second line just removes '\t', you'd be
better off using std::remove:
content.erase( std::remove( content.begin(), content.end(), '\t' ), content.end() );

AFAIK stl strings are always allocated on the heap if they go over a certain (small) size, eg 32 chars in Visual Studio
What you can do if you get allocation exceptions:
Use a custom allocator
Use a "rope" class.
Bad alloc might not mean you're run out of memory, more likely that you're run out of contiguous memory. A rope class might be better suited to you as it allocated strings in pieces internally.

This is one of the correct (and reasonably efficient) ways to remove characters from a string if you want to make a copy and leave the original intact:
#include <algorithm>
#include <string>
std::string delete_char(std::string src, char to_remove)
{
// note: src is a copy so we can mutate it
// move all offending characters to the end and get the iterator to last good char + 1
auto begin_junk = std::remove_if(src.begin(),
src.end(),
[&to_remove](const char c) { return c == to_remove; });
// chop off all the characters we wanted to remove
src.erase(begin_junk,
src.end());
// move the string back to the caller's result
return std::move(src);
}
called like this:
std::string src("a\nb\bc");
auto dest = delete_char(src, '\n');
assert(dest == "abc");
If you'd prefer to modify the string in place then simply:
src.erase(std::remove_if(src.begin(), src.end(), [](char c) { return c == '\n'; }), src.end());

Related

Regex matches under g++ 4.9 but fails under g++-5.3.1

I am tokenizing a string with a regex; this works normally under g++-4.9, but fails under g++-5.3.1.
I have the following txt file:
0001-SCAND ==> "Scandaroon" (from Philjumba)
0002-KINVIN ==> "King's Vineyard" (from Philjumba)
0003-HANNI ==> "Hannibal: Rome vs. Carthage" (from Philjumba)
0004-LOX ==> "Lords of Xidit" (from Philjumba)
which I am tokenizing using regular expressions, by spaces, quotation marks pairs and parentheses pairs. For example, the first line should be tokenized as follows:
0001-SCAND
==>
"Scandaroon"
(from Philjumba)
I have written the following std::regex:
std::regex FPAT("(\\S+)|(\"[^\"]*\")|(\\([^\\)]+\\))";
And I am tokenizing the string with:
std::vector<std::string>
split( const std::string & input, const std::regex & regex ) {
std::sregex_token_iterator
first{input.begin(), input.end(), regex, 0},
last;
return {first, last};
}
This returns the matches. Under g++-4.9 the string is tokenized as requested, but under g++-5.3.1 it's tokenized as follows:
0001-SCAND
==>
"Scandaroon"
(from
Philjumba)
or the third line is tokenized as follows:
0003-HANNI
==>
"Hannibal:
Rome
vs.
Carthage"
(from
Philjumba)
What could the issue be?
edit: I am calling the function as follows:
std::string line("0001-SCAND ==> \"Scandaroon\" (from Philjumba)");
auto elems = split( line, FPAT );
edit: following feedback from #xaxxon, I replaced returning the iterator by a vector, but it's still not working correctly under g++-5.3.
std::vector<std::string>
split( const std::string & input, const std::regex & regex ) {
std::sregex_token_iterator
first{input.begin(), input.end(), regex, 0},
last;
std::vector< std::string > elems;
elems.reserve( std::distance(first,last) );
for ( auto it = first; it != last; ++ it ) {
//std::cout << (*it) << std::endl;
elems.push_back( *it );
}
return elems;
}
Regular expression is Eager
so for a regular expression "Set|SetValue" and the text "SetValue", regex founds "Set".
You have to choose order carefully:
std::regex FPAT(R"(("[^\"]*\")|(\([^\)])+\)|(\S+))");
\S+ at the end to be the last considered.
An other alternative is to use not the default option (see http://en.cppreference.com/w/cpp/regex/syntax_option_type)
and use std::::regex::extended
std::regex FPAT(R"((\S+)|("[^\"]*\")|(\([^\)])+\))", std::::regex::extended);
So it seems that g++-5.3.1 has fixed a bug since g++-4.9 in this regard.
You don't post enough for me to know for sure (you updated it showing you are calling it with an lvalue, so this post probably doesn't pertain, but I'll leave it up unless people want me to take it down), but if you're doing what I did, you forgot that the iterators are into the source string and that string is no longer valid.
You could remove the const from input, but it's so damn convenient to be able to put an rvalue there, so.....
Here's what I do to avoid this - I return a unique_ptr to something that looks like the results, but I hide the actual source string along with it so the strsing can't go away before I'm done using it. This is likely UB, but I think it will work virtually all the time:
// Holds a regex match as well as the original source string so the matches remain valid as long as the
// caller holds on to this object - but it acts just like a std::smatch
struct MagicSmatch {
std::smatch match;
std::string data;
// constructor makes a copy of the string and associates
// the copy's lifetime with the iterators into the string (the smatch)
MagicSmatch(const std::string & data) : data(data)
{}
};
// this deleter knows about the hidden string and makes sure to delete it
// this cast is probably UB because std::smatch isn't a standard layout type
struct MagicSmatchDeleter {
void operator()(std::smatch * smatch) {
delete reinterpret_cast<MagicSmatch *>(smatch);
}
};
// the caller just thinks they're getting a smatch ptr.. but we know the secret
std::unique_ptr<std::smatch, MagicSmatchDeleter> regexer(const std::regex & regex, const std::string & source)
{
auto magic_smatch = new MagicSmatch(source);
std::regex_search(magic_smatch->data, magic_smatch->match, regex);
return std::unique_ptr<std::smatch, MagicSmatchDeleter>(reinterpret_cast<std::smatch *>(magic_smatch));
}
as long as you call it as auto results = regexer(....) then it's quite easy to use, though results is a pointer, not a proper smatch, so the [] syntax doesn't work as nicely.

Standard algorithm for accumulating a container into a string with a delimiter separating the entries?

I am looking for a standard library equivalent of this code for accumulating elements of an std container into a string with a delimiter separating consecutive entries:
string accumulate_with_delimiter( vector<string> strvect, string delimiter )
{
string answer;
for( vector<string>::const_iterator it = strvect.begin(); it != strvect.end(); ++it )
{
answer += *it;
if( it + 1 != strvect.end() )
{
answer += delimiter;
}
}
return answer;
}
Such code seems to be very common: printing out an array with delimiter " ", or saving into a CSV file with delimiter ",", etc. Therefore it's likely that a piece of code like that made its way into a standard library. std::accumulate comes close, but doesn't have a delimiter.
I don't think the standard C++ library has a nice approach to delimiting sequences. I typically end up using something like
std::ostringstream out;
if (!container.empty()) {
auto end(container.end());
std::copy(container.begin(), --end, std::ostream_iterator<T>(out, ", "));
out << *end;
}
Using std::accumulate() has a similar problem of although with the first element rather than the last element. Using a custom add function, you could use it something like this:
std::string concat;
if (!container.empty()) {
auto begin(container.begin());
concat = std::accumulate(++begin, container.end(), container.front(),
[](std::string f, std::string s) { return f + ", " + s; });
}
In both cases the iterators need to be moved to another element. The code uses temporary objects when moving the iterator because the container may use pointers as iterator in which case a pre-increment or pre-decrement on the result from begin() or end() doesn't work.
std::accumulate might be the correct answer, but you need the version which takes a custom adder. You can then provide your own lambda.
Remember to pass front() as the first value to accumulate, and start adding at begin() + 1. And test for empty vectors first of course.
I'm not sure if there is one in the recent Standard Library or not, but there is always boost::algorithm::join(strvec, delimiter).

Remove first and last instance of a character from a string?

I assume the following doesn't compile because I am mixing forward and reverse iterators. Why can't I mix them like this? How can I get it to work? I want to remove the first and last quote of the string, but leave any internal quotes present.
temp.assign(find(value.begin(), value.end(), '\"'), find(value.rbegin(), value.rend(), '\"'));
I cannot even do this. What is the point of reverse iterators?
value.erase(find(value.begin(), value.end(), '\"'));
value.erase(find(value.rbegin(), value.rend(), '\"'));
The assign function (regardless of the type of temp) requires
two iterators of the same type. A reverse iterator doesn't have
the same type as a normal iterator. You can get at the
underlying normal iterator using the base() function on the
reverse iterator, but be careful; it is one behind the position the
reverse iterator is pointing to. For example, if you write
temp.assign( find( value.begin(), value.end(), '\"' ),
find( value.rbegin(), value.rend(), '\"').base() );
, the trailing '"' will be part of the resulting string.
This particular behavior is often what you want when you're
using the results as a beginning iterator:
std::string( std::find( fn.rbegin(), fn.rend(), '.' ), fn.end() )
, for example, will give you all of the text after the last
'.'. When using the results of a find with reverse iterators
as the end criteron, you'll usually need to save it in
a variable, and "correct" it in some way.
Finally, you should be extremely cautious about using the
results of two finds to define a range, like you do above. If
there's no '"' in your text, for example, you'll end up with
the equivalent of:
temp.assign( value.end(), value.begin() );
, which isn't going to work.
EDIT:
As an example, if you don't want the '"' characters, I think the
following should work:
// Returns an empty string if there are not two " chars.
std::string
extractQuoted( std::string const& original )
{
// end points to one after the last '"', or begin() if no '"'.
std::string::const_iterator end
= std::find( original.rbegin(), original.rend(), '"' ).base();
if ( end != original.begin() ) {
-- end; // Since we don't want the '"' in the final string
}
return std::string( std::find( original.begin(), end, '"' ), end );
}
(It's off the top of my head, so no guarantees, but it should
get you started in the right direction.)
If you want to use reverse_itertators, call .base() on them to the underlying iterator type. e.g.
value.erase(find(value.begin(), value.end(), '\"'));
value.erase(find(value.rbegin(), value.rend(), '\"').base());
might do the trick.
You can also use
value.erase(0, value.find('\"') + 1);
value.erase(value.rfind('\"'), value.size());
Assuming the string will always contain atleast two "s
I think the problem is that erase expects a forward iterator, not a reverse iterator. As others have pointed out you can just use .base() on the result of find to convert them.
You could also use find_end instead.
How about this?
void stripLeadingAndTrailingQuotes(std::string& s) {
if (s[0] == '\"')
s.erase(0, 1);
size_t last = s.length() - 1;
if (s[last] == '\"')
s.erase(last, 1);
}
Iterators are valid but I think the above is a lot more readable.

Replace multiple spaces with one space in a string

How would I do something in c++ similar to the following code:
//Lang: Java
string.replaceAll(" ", " ");
This code-snippet would replace all multiple spaces in a string with a single space.
bool BothAreSpaces(char lhs, char rhs) { return (lhs == rhs) && (lhs == ' '); }
std::string::iterator new_end = std::unique(str.begin(), str.end(), BothAreSpaces);
str.erase(new_end, str.end());
How this works. The std::unique has two forms. The first form goes through a range and removes adjacent duplicates. So the string "abbaaabbbb" becomes "abab". The second form, which I used, takes a predicate which should take two elements and return true if they should be considered duplicates. The function I wrote, BothAreSpaces, serves this purpose. It determines exactly what it's name implies, that both of it's parameters are spaces. So when combined with std::unique, duplicate adjacent spaces are removed.
Just like std::remove and remove_if, std::unique doesn't actually make the container smaller, it just moves elements at the end closer to the beginning. It returns an iterator to the new end of range so you can use that to call the erase function, which is a member function of the string class.
Breaking it down, the erase function takes two parameters, a begin and an end iterator for a range to erase. For it's first parameter I'm passing the return value of std::unique, because that's where I want to start erasing. For it's second parameter, I am passing the string's end iterator.
So, I tried a way with std::remove_if & lambda expressions - though it seems still in my eyes easier to follow than above code, it doesn't have that "wow neat, didn't realize you could do that" thing to it.. Anyways I still post it, if only for learning purposes:
bool prev(false);
char rem(' ');
auto iter = std::remove_if(str.begin(), str.end(), [&] (char c) -> bool {
if (c == rem && prev) {
return true;
}
prev = (c == rem);
return false;
});
in.erase(iter, in.end());
EDIT realized that std::remove_if returns an iterator which can be used.. removed unnecessary code.
A variant of Benjamin Lindley's answer that uses a lambda expression to make things cleaner:
std::string::iterator new_end =
std::unique(str.begin(), str.end(),
[=](char lhs, char rhs){ return (lhs == rhs) && (lhs == ' '); }
);
str.erase(new_end, str.end());
Why not use a regular expression:
boost::regex_replace(str, boost::regex("[' ']{2,}"), " ");
how about isspace(lhs) && isspace(rhs) to handle all types of whitespace

C++ removing punctuation on strings, erase()/iterator issue

I know I'm not the first person to bring up the issue with reverse iterators trying to call the erase() method on strings. However, I wasn't able to find any good ways around this.
I'm reading the contents of a file, which contains a bunch of words. When I read in a word, I want to pass it to a function I have called stripPunct. However, I ONLY want to strip punctuation at the beginning and end of a string, not in the middle.
So for instance:
(word) should strip '(' and ')' resulting in just word
don't! should strip '!' resulting in just don't
So my logic (which I'm sure could be improved) was to have two while loops, one starting at the end and one at the beginning, traversing and erasing until it hits a non-punctuation char.
void stripPunct(string & str) {
string::iterator itr1 = str.begin();
string::reverse_iterator itr2 = str.rbegin();
while ( ispunct(*itr1) ) {
str.erase(itr1);
itr1++;
}
while ( ispunct(*itr2) ) {
str.erase(itr2);
itr2--;
}
}
However, obviously it's not working because erase() requires a regular iterator and not a reverse_iterator. But anyways, I feel like that logic is pretty inefficient.
Also, I tried instead of a reverse_iterator using just a regular iterator, starting it at str.end(), then decremented it, but it says I cannot dereference the iterator if I start it at str.end().
Can anyone help me with a good way to do this? Or maybe point out a workaround for what I already have?
Thank you so much in advance!
------------------ [ EDIT ] ----------------------------
found a solution, although it may not be the best solution:
// Call the stripPunct method:
stripPunct(str);
if ( !str.empty() ) { // make sure string is still valid
// perform other code
}
And here is the stripPunct method:
void stripPunct(string & str) {
string::iterator itr1 = str.begin();
string::iterator itr2 = str.end();
while ( !(str.empty()) && ispunct(*itr1) )
itr1 = str.erase(itr1);
itr2--;
if ( itr2 != str.begin() ) {
while ( !(str.empty()) && ispunct(*itr2) ) {
itr2 = str.erase(itr2);
itr2--;
}
}
}
First, note a couple problems with your code:
after you call erase() using itr1, you've invalidated itr2.
when using a reverse_iterator to go backwards through a sequence, you want to use ++, not -- (that's kind of the reason reverse iterators exist).
Now, to improve the logic, you can avoid erasing each character individually by finding the first charater you don't want to erase and erase everything up to that point. find_if() can be used to help with that:
int not_punct(char c) {
return !ispunct((unsigned char) c);
}
void stripPunct(string & str) {
string::iterator itr = find_if( str.begin(), str.end(), not_punct);
str.erase( str.begin(), itr);
string::reverse_iterator ritr = find_if( str.rbegin(), str.rend(), not_punct);
str.erase( ritr.base(), str.end());
}
Note that I've used base() to get the 'regular' iterator corresponding to the reverse_iterator. I find the logic for whether base() needs to be adjusted confusing (reverse iterators in general confuse me)- in this case it doesn't because we happen to want to start the erase after the character that's found.
This article by Scott Meyers, http://drdobbs.com/cpp/184401406, has a good treatment of reverse_iterator::base() in the section. "Guideline 3: Understand How to Use a reverse_iterator's Base iterator". The information in that article has also been incorporated into Meyer's "Effective STL" book.
You can't dereference iterator::end() because it points to invalid memory (memory right after the end of the array), so you have to decrement it first.
And one final note: if the word consists only of punctuations, your program will fail, be sure to handle that.
If you don't mind negative logic, you can do the following:
string tmp_str="";
tmp_str.reserve(str.length());
for (string::iterator itr1 = str.begin(); itr1 != str.end(); itr1++)
{
if (!ispunct(*itr1))
{
tmp_str.push_back(*itr1);
}
}
str = tmp_str;