split a string into sequences of consecutive non-whitespace characters - c++

I want to split a string into sequences of consecutive non-whitespace characters. For example, given
std::string test(" 35.3881 12.3637 39.3485");,
I want to obtain an iterator which points to "35.3881", its increment points to "12.3637" and the second increment points to "39.3485".
As in the given example, it's possible (but not guaranteed) that the string starts with an unknown number of whitespaces. Moreover, the number of whitespaces between sequences of non-whitespace characters is also unknown. Trailing whitespaces are possible too.
The following code almost solves my problem:
std::regex regex("\\s+");
std::sregex_token_iterator it(test.cbegin(), test.cend(), regex, -1);
The problem, even with the given example, is that here the iterator it initially points to an empty string; which is not the behavior I desire. How can we fix that?

I'd just use a normal stream_iterator:
std::istringstream test(" 35.3881 12.3637 39.3485");
std::copy(std::istream_iterator<std::string>(test),
{},
std::ostream_iterator<std::string>(std::cout, "\n"));
Result:
35.3881
12.3637
39.3485
If, as noted in the comments, you find it important to avoid copying the data, you can use an istrstream instead of an instringstream, something like this:
#include <strstream>
// ...
std::string test(" 35.3881 12.3637 39.3485");
std::istrstream buffer(test.c_str());
std::copy(std::istream_iterator<std::string>(buffer),
{},
std::ostream_iterator<std::string>(std::cout, "\n"));
Note: <strstream> and everything contains is officially deprecated, so in theory, it could disappear in some future version of the standard. I'd eventually expect to see something based on a string_view, which would also avoid copying the data, but I'm not sure it actually exists yet (it certainly doesn't in the compiler I'm using at the moment).

Related

Properly checking for palindromes using UTF-8 strings in C++

When trying to answer a question, How to use enqueu, dequeue, push, and peek in a Palindrome?, I suggested a palindrome can be found using std::string by:
bool isPalindrome(const std::string str)
{
return std::equal(str.begin(), str.end(), str.rbegin(), str.rend());
}
For a Unicode string, I suggested:
bool isPalindrome(const std::u8string str)
{
std::u8string rstr{str};
std::reverse(rstr.begin(), rstr.end());
return str == rstr;
}
I now think this will create problems when you have multibyte characters in the string because the byte-order of the multibyte character is also reversed. Also, some characters will be equivalent to each other in different locales. Therefore, in C++20:
how do you make the comparison robust to multibyte characters?
how do you make the comparison robust to different locales when there can be equivalency between multiple characters?
Reversing a Unicode string becomes non-trivial. Converting from UTF-8 to UTF-32/UCS-4 is a good start, but not sufficient by itself--Unicode also has combining code points, so two (or more) consecutive code points form a single resulting grapheme (the added code point(s) add(s) diacritic marking to the base character), and for things to work correctly, you need to keep these in the correct order.
So, basically instead of code points, you need to divide the input up into a series of graphemes, and reverse the order of the graphemes, not just the code points.
To deal with multiple different sequences of code points that represent the same sequence of characters, you normally want to do normalization. There are four different normalization forms. In this case, you'd probably want to use NFC or NFD (should be equivalent for this purpose). The NFKC/NFKD forms are primarily for compatibility with other character sets, which it sounds like you probably don't care about.
This can also be non-trivial though. Just for one well known example, consider the German character "ß". This is sort of equivalent to "ss", but only exists in lower-case, since it never occurs at the beginning of a word. So, there's probably room for argument about whether something like Ssaß is a palindrome or not (for the moment ignoring the minor detail that it's not actually a word). For palindromes, most people ignore letter case, so it would be--but your code in the question seems to treat case as significant, in which case it probably shouldn't be.

Using a regex_iterator on an istream

I want to be able to solve problems like this: Getting std :: ifstream to handle LF, CR, and CRLF? where an istream needs to be tokenized by a complex delimiter; such that the only way to tokenize the istream is to:
Read it in the istream a character at a time
Collect the characters
When a delimiter is hit return the collection as a token
Regexes are very good at tokenizing strings with complex delimiters:
string foo{ "A\nB\rC\n\r" };
vector<string> bar;
// This puts {"A", "B", "C"} into bar
transform(sregex_iterator(foo.cbegin(), foo.cend(), regex("(.*)(?:\n\r?|\r)")), sregex_iterator(), back_inserter(bar), [](const smatch& i){ return i[1].str(); });
But I can't use a regex_iterator on a istream :( My solution has been to slurp the istream and then run the regex_iterator over it, but the slurping step seems superfluous.
Is there an unholy combination of istream_iterator and regex_iterator out there somewhere, or if I want it do I have to write it myself?
This question is about code appearance:
Since we know that a regex will work 1 character at a time, this question is asking to use a library to parse the istream 1 character at a time rather than internally reading and parsing the istream 1 character at a time
Since parsing an istream 1 character at a time will still copy that one character to a temp variable (buffer) this code seeks to avoid buffering all the code internally, depending on a library instead to abstract that
C++11's regexes use ECMA-262 which does not support look aheads or look behinds: https://stackoverflow.com/a/14539500/2642059 This means that a regex could match using only an input_iterator_tag, but clearly those implemented in C++11 do not.
boost::regex_iterator on the other hand does support the boost::match_partial flag (which is not available in C++11 regex flags.) boost::match_partial allows the user to slurp part of the file and run the regex over that, on a mismatch due to end of input the regex will "hold it's finger" at that position in the regex and await more being added to the buffer. You can see an example here: http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/partial_matches.html In the average case, like "A\nB\rC\n\r", this can save buffer size.
boost::match_partial has 4 drawbacks:
In the worst case, like "ABC\n" this saves the user no size and he must slurp the whole istream
If the programmer can guess a buffer size that is too large, that is it contains the delimiter and a significant amount more, the benefits of the reduction in buffer size are squandered
Any time the buffer size selected is too small, additional computations will be required compared to the slurping of the entire file, therefore this method excels in a delimiter-dense string
The inclusion of boost always causes bloat
Circling back to answer the question: A standard library regex_iterator cannot operate on an input_iterator_tag, slurping of the whole istream required. A boost::regex_iterator allows the user to possibly slurp less than the whole istream. Because this is a question about code appearance though, and because boost::regex_iterator's worst case requires slurping of the whole file anyway, it is not a good answer to this question.
For the best code appearance slurping the whole file and running a standard regex_iterator over it is your best bet.
I think not. istream_iterator has the input_iterator_tag tag, whereas regex_iterator expects to be initialized using bi-directional iterators (bidirectional_iterator_tag).
If your delimiter regex is complex enough to avoid reading the stream yourself, the best way to do this is to indeed slurp the istream.

Validate ASCII GnuPlot file with c++ regex

I have been trying to get this right, but cannot seem to make things work the way I want it to.
I have an ASCII file containing several million lines of floating point values, seperated by spaces. Reading these values is straightforward using std::istream_iterator<double> but I wanted to validate the file upfront to make sure it is really formatted the way I described. Since there is only one correct format, and gazillions of way how it can be illformed, I wanted to go about it using std::regex.
This is what I came up with:
std::string begln( "^" );
std::string endln( "$" );
std::string fp( "[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?." );
std::string space( "[[:space:]]{1}" );
std::regex regexp( "(" + begln + fp + space + fp + space + fp + endln + ")+" );
What I wanted to express was: A line consists of something between the beginning and end of the line, which consists of three sets of floating point values seperated with a single space, and I am looking for one or more of these lines.
I would expect a valid datafile to have a single match without prefix and suffix.
But hey, since these values will go into a std::vector<std::array<double, 3>>, why don't I reuse the regex machinery and obtain the values from a match list? If the file is valid, then an absolutely trivial regex could match just individual lines, and construct a std::sregex_iterator to iterate over the lines. At this point, it is only a matter of obsession how one obtains the values from a singe std::string of a line, whether using regex again or std::stringsteam.
Why not? The reason why you wouldn't want this is because regex'es are absolute overkill. They can match far more complex grammars, and are capable of reading in those grammars at runtime. That flexibility comes at a high price. All the possible parsers must be included. No current compiler is smart enough to see that you just used [[:space:]] as a regex. (In fact, no C++ compiler or linker knows anything about regex - that's purely a library thing).
In comparison, operator>> is overloaded and the compiler sees exactly which overloads you use at compile time. The linker is told this, and includes just the relevant code.
Furthermore, the CPU branch predictor will soon notice that operator>> almost always succeeds, which is a further speedup. Your regex code is less likely to benefit in the same way - the conditional part in [0-9]* is at least one level of indirection deeper.

Replacing instances of a given std::string with another std::string in C++

I have been looking online without success for something that does the following. I have some ugly string returned as part of a Betfair SOAP response that uses a number of different char delimiters to identify certain parts of the information. What makes it awkward is that they are not always just one character length. Specifically, I need to split a string at ':' characters, but only after I have first replaced all instances of "\\:" with my personal flag "-COLON-" (which must then be replaced again AFTER the first split).
Basically I need all portions of a string like this
"6(2.5%,11\:08)~true~5.0~1162835723938~"
to become
"6(2.5%,11-COLON-08)~true~5.0~1162835723938~
In perl it is (from memory)
$mystring =~ s/\\:/-COLON-/g;
I have been looking for some time at the functions of std::string, specifically std::find and std::replace and I know that I can code up how to do what I need using these basic functions, but I was wondering if there was a function in the standard library (or elsewhere) that already does this??
boost::replace_all(input_string, "\\:", "-COLON-");
If you have C++11 something like this ought to do the trick:
#include <string>
#include <regex>
int main()
{
std::string str("6(2.5%,11\\:08)~true~5.0~1162835723938~");
std::regex rx("\\:");
std::string fmt("-COLON-");
std::regex_replace(str, rx, fmt);
return 0;
}
Edit: There is an optional fourth parameter for the type of match as well which can be anything found in std::regex_constants namespace I do believe. For example replacing only the first occurrence of the regular expression match with the supplied format.

Behavior of STL remove() function - only rearrange container elements?

I've read here on StackOveflow and other sources that the behavior of the remove function is simply re-ordering the original container so that the elements that are TO BE REMOVED are moved to the end of the container and ARE NOT deleted. They remain part of the container and the remove() function simply returns an iterator that delimits the end of the range of elements to keep.
So if you never actually trim off the portion of the container that has the values that have been 'removed', they should still be present.
But when I run the code below there are no trailing spaces after the alphanumeric characters that were not 'removed'.
int main()
{
std::string test "this is a test string with a bunch of spaces to remove";
remove(test.begin(), test.end(), ' ');
std::cout << test << std::endl;
return 0;
}
What is going on here? Seeing as I never call test.erase() shouldn't I have a bunch of trailing spaces on my string? Is it guaranteed that the 'removed' items will still be present after calling remove()?
PS-I'm not looking for suggestions on how to best remove spaces from a string, the above is simply an illustrative example of the remove() behavior that is confusing me.
What's left at the end of your container after a call to remove is not necessarily the elements that were removed. It's just junk. Most likely it's "whatever was in those positions before the call to remove", but you can't rely on that either. Much like an uninitialized variable, it could be anything.
For example, the string:
"Hi I am Bob!\0"
after a call to remove to get rid of spaces probably looks like this
"HiIamBob!\0b!\0"
You won't see that on a cout, though, because it will stop printing once it hits the '\0'.
You might be willing to get Boost.String.
It's a collection of algorithms to act on strings.
boost::erase_all(test, " ");