How to split a string and keep the delimiters using boost::split?

How to split a string and keep the delimiters using boost::split? - c++

I have a string like this:
std::string input("I #am going to# learn how #to use #boost# library#");
I do this:
std::vector<std::string> splitVector;
boost::split(splitVector, input, boost::is_any_of("#"));
And got this: (splitVector)
splitVector:
"I "
"am going to"
" learn how "
"to use "
"boos"
" library"
"" // **That's odd, why do I have an empty string here ?**
But need something like this:
splitVector:
"I "
"#am going to"
"# learn how "
"#to use "
"#boost"
"# library"
"#"
How to do that ? Or maybe there is another way to do it in boost library ?
And why do I get an empty string in splitVector ?

You cannot use boost::split because the internal implementation that uses the split_iterator from boost/algorithm/string/find_iterator.hpp swallows the tokens.
However you can get by with boost::tokenizer, as it has an option to keep the delimiters:
Whenever a delimiter is seen in the input sequence, the current token is finished, and a new token begins. The delimiters in dropped_delims do not show up as tokens in the output whereas the delimiters in kept_delims do show up as tokens. http://www.boost.org/doc/libs/1_55_0/libs/tokenizer/char_separator.htm
See next live:
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
int main() {
// added consecutive tokens for illustration
std::string text = "I #am going to# learn how ####to use #boost# library#";
boost::char_separator<char> sep("", "#"); // specify only the kept separators
boost::tokenizer<boost::char_separator<char>> tokens(text, sep);
for (std::string t : tokens) { std::cout << "[" << t << "]" << std::endl; }
}
/* Output:
[I ]
[#]
[am going to]
[#]
[ learn how ]
[#]
[#]
[#]
[#]
[to use ]
[#]
[boost]
[#]
[ library]
[#] */

Related

Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

So I need to tokenize a string by all spaces not between quotes, I am using regex in Javascript notation.
For example:
" Test Test " ab c " Test" "Test " "Test" "T e s t"
becomes
[" Test Test ",ab,c," Test","Test ","Test","T e s t"]
For my use case however, the solution should work in the following test setting:
https://www.regextester.com/
All Spaces not within quotes should be highlighted in the above setting. If they are highlighted in the above setting they would be parsed correctly in my program.
For more specificity, I am using Boost::Regex C++ to do the parsing as follows:
...
std::string test_string("\" Test Test \" ab c \" Test\" \"Test \" \"Test\" \"T e s t\"");
// (,|;)?\\s+ : Split on ,\s or ;\s
// (?![^\\[]*\\]) : Ignore spaces inside []
// (?![^\\{]*\\}) : Ignore spaces inside {}
// (?![^\"].*\") : Ignore spaces inside "" !!! MY ATTEMPT DOESN'T WORK !!!
//Note the below regex delimiter declaration does not include the erroneous regex.
boost::regex delimiter("(,|;\\s|\\s)+(?![^\\[]*\\])(?![^\\(]*\\))(?![^\\{]*\\})");
std::vector<std::string> string_vector;
boost::split_regex(string_vector, test_string, delimiter);
For those of you who do not use Boost::regex or C++ the above link should enable testing of viable regex for the above use case.
Thank you all for you assistance I hope you can help me with the above problem.

I would 100% not use regular expressions for this. First off, because it's way easier to express as a PEG grammar instead. E.g.:
std::vector<std::string> tokens(std::string_view input) {
namespace x3 = boost::spirit::x3;
std::vector<std::string> r;
auto atom //
= '[' >> *~x3::char_(']') >> ']' //
| '{' >> *~x3::char_('}') >> '}' //
| '"' >> *~x3::char_('"') >> '"' //
| x3::graph;
auto token = x3::raw[*atom];
parse(input.begin(), input.end(), token % +x3::space, r);
return r;
}
This, off the bat, already performs as you intend:
Live On Coliru
int main() {
for (std::string const input : {R"(" Test Test " ab c " Test" "Test " "Test" "T e s t")"}) {
std::cout << input << "\n";
for (auto& tok : tokens(input))
std::cout << " - " << quoted(tok, '\'') << "\n";
}
}
Output:
" Test Test " ab c " Test" "Test " "Test" "T e s t"
- '" Test Test "'
- 'ab'
- 'c'
- '" Test"'
- '"Test "'
- '"Test"'
- '"T e s t"'
BONUS
Where this really makes the difference, is when you realize that you wanted to be able to handle nested constructs (e.g. "string" [ {1,2,"3,4", [true,"more [string]"], 9 }, "bye ]).
Regular expressions are notoriously bad at this. Spirit grammar rules can be recursive though. If you make your grammar description more explicit I could show you examples.

You can use multiple regexes if you are ok with that. The idea is to replace spaces inside quotes with a non-printable char (\x01), and restore them after the split:
const input = `" Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
.replace(/"[^"]*"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
.split(/ +/) // split on spaces
.map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);
If you have escaped quotes within a string, such as "a \"quoted\" token" you can use this regex instead:
const input = `"A \"quoted\" token" " Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
.replace(/".*?[^\\]"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
.split(/ +/) // split on spaces
.map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);
If you want to parse nested brackets you need a proper language parser. You can also do that with regexes however: Parsing JavaScript objects with functions as JSON
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

How can I find the first word in a vector of strings that matches a user given prefix?

Let's say I have a sorted vector of strings:
std::vector<std::string> Dictionary
Dictionary.push_back("ant");
Dictionary.push_back("anti-matter");
Dictionary.push_back("matter");
Dictionary.push_back("mate");
Dictionary.push_back("animate");
Dictionary.push_back("animal");
std::sort(Dictionary.begin(), Dictionary.end());
I want to find the first word in the vector that matches a prefix, but every example I found use a hard coded string as prefix. For example, I can define a boolean unary function for finding the "an" prefix:
bool find_prefix(std::string &S) {
return S.compare(0, 2, "an");
}
and use it as the predicate of the std::find_if() function to find an iterator to the first match. But how can I search for user given string as a prefix? Is it possible to use binary predicates in some way? Or build a "pseudo-unary" predicate that depends on a variable and a parameter?
Or, is there any other container and methods that I should use in this problem?
I know that there are much more efficient and elegant structures to store a dictionary for prefix search, but I'm a beginner self-learning programming, so first I'd like to learn how to use the standard containers before adventuring in more complex structures.

You can write find_prefix as a lambda. That lets you capture the string you want to search for, and use that for the comparison:
string word = ... // the prefix you're looking for
auto result = std::find_if(Dictionary.begin(), Dictionary.end(),
[&word](string const &S) {
return ! S.compare(0, word.length(), word);
});

Since you are sorting the vector, you should take advantage that the vector is sorted.
Rather than doing a linear search for a match, you can use std::lower_bound to put you close to, if not right on the entry that matches the prefix:
#include <vector>
#include <string>
#include <iostream>
#include <algorithm>
int main()
{
std::vector<std::string> Dictionary;
Dictionary.push_back("ant");
Dictionary.push_back("anti-matter");
Dictionary.push_back("matter");
Dictionary.push_back("mate");
Dictionary.push_back("animate");
Dictionary.push_back("animal");
std::sort(Dictionary.begin(), Dictionary.end());
std::vector<std::string> search_test = {"an", "b", "ma", "m", "x", "anti"};
for (auto& s : search_test)
{
auto iter = std::lower_bound(Dictionary.begin(), Dictionary.end(), s);
// see if the item returned actually is a match
if ( iter->size() >= s.size() && iter->substr(0, s.size()) == s )
std::cout << "The string \"" << s << "\" has a match on \"" << *iter << "\"\n";
else
std::cout << "no match for \"" << s << "\"\n";
}
}
Output:
The string "an" has a match on "animal"
no match for "b"
The string "ma" has a match on "mate"
The string "m" has a match on "mate"
no match for "x"
The string "anti" has a match on "anti-matter"
The test after the lower_bound is done to see if the string actually matches the one found by lower_bound.

Keep quotation marks in a formatted list Racket

How do you print a formatted string with quotation marks, and without the backward slashes?
For example, when I enter
(format "say ~a" "hello there!")
I want to get
" say "hello there!" "
I want the quotation marks wrapped around "hello there" as the way I typed in. However, if I format it as a string, it turns out like this:
"say \"hello there!\""
Is there a way to keep the quotation marks without having the backward slash?

evaluating strings, and print/println print the quote " as\".
Maybe you're looking for display/displayln:
(displayln (format "say \"~a\"" "hello there!"))
; => say "hello there!"

use ~s instead of ~a
> (format "say ~s" "hello there!")`
"say \"hello there!\""

Regular expression validation fails while egrep validates just fine

I'm trying to use regular expressions in order to validate strings so before I go any further let me explain first how the strings looks like: optional number of digits followed by an 'X' and an optional ('^' followed by one or more digits).
Here are some exmaples: "2X", "X", "23X^6" fit the pattern while strings like "X^", "4", "foobar", "4X^", "4X44" don't.
Now where was I: using 'egrep' and the "^[0-9]{0,}\X(\^[0-9]{1,})$" regex I can validate just fine those strings however when trying this in C++ using the C++11 regex library it fails.
Here's the code I'm using to validate those strings:
#include <iostream>
#include <regex>
#include <string>
#include <vector>
int main()
{
std::regex r("^[0-9]{0,}\\X(\\^[0-9]{1,})$",
std::regex_constants::egrep);
std::vector<std::string> challanges_ok {"2X", "X", "23X^66", "23X^6",
"3123X", "2313131X^213213123"};
std::vector<std::string> challanges_bad {"X^", "4", "asdsad", " X",
"4X44", "4X^"};
std::cout << "challanges_ok: ";
for (auto &str : challanges_ok) {
std::cout << std::regex_match(str, r) << " ";
}
std::cout << "\nchallanges_bad: ";
for (auto &str : challanges_bad) {
std::cout << std::regex_match(str, r) << " ";
}
std::cout << "\n";
return 0;
}
Am I doing something wrong or am I missing something? I'm compiling under GCC 4.7.

Your regex fails to make the '^' followed by one or more digits optional; change it to:
"^[0-9]*X(\\^[0-9]+)?$".
Also note that this page says that GCC's support of <regex> is only partial, so std::regex may not work at all for you ('partial' in this context apparently means 'broken'); have you tried Boost.Xpressive or Boost.Regex as a sanity check?

optional number of digits followed by an 'X' and an optional ('^' followed by one or more digits).
OK, the regular expression in your code doesn't match that description, for two reasons: you have an extra backslash on the X, and the '^digits' part is not optional. The regex you want is this:
^[0-9]{0,}X(\^[0-9]{1,}){0,1}$
which means your grep command should look like this (note single quotes):
egrep '^[0-9]{0,}X(\^[0-9]{1,}){0,1}$' filename
And the string you have to pass in your C++ code is this:
"^[0-9]{0,}X(\\^[0-9]{1,}){0,1}$"
If you then replace all the explicit quantifiers with their more traditional abbreviations, you get #ildjarn's answer: {0,} is *, {1,} is +, and {0,1} is ?.

Get String Between 2 Strings

How can I get a string that is between two other declared strings, for example:
String 1 = "[STRING1]"
String 2 = "[STRING2]"
Source:
"832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
How can I get the "I need this text here"?

Since this is homework, only clues:
Find index1 of occurrence of String1
Find index2 of occurrence of String2
Substring from index1+lengthOf(String1) (inclusive) to index2 (exclusive) is what you need
Copy this to a result buffer if necessary (don't forget to null-terminate)

Might be a good case for std::regex, which is part of C++11.
#include <iostream>
#include <string>
#include <regex>
int main()
{
using namespace std::string_literals;
auto start = "\\[STRING1\\]"s;
auto end = "\\[STRING2\\]"s;
std::regex base_regex(start + "(.*)" + end);
auto example = "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"s;
std::smatch base_match;
std::string matched;
if (std::regex_search(example, base_match, base_regex)) {
// The first sub_match is the whole string; the next
// sub_match is the first parenthesized expression.
if (base_match.size() == 2) {
matched = base_match[1].str();
}
}
std::cout << "example: \""<<example << "\"\n";
std::cout << "matched: \""<<matched << "\"\n";
}
Prints:
example: "832h0ufhu0sdf4[STRING1]I need this text here[STRING2]afyh0fhdfosdfndsf"
matched: "I need this text here"
What I did was create a program that creates two strings, start and end that serve as my start and end matches. I then use a regular expression string that will look for those, and match against anything in-between (including nothing). Then I use regex_match to find the matching part of the expression, and set matched as the matched string.
For more info, see http://en.cppreference.com/w/cpp/regex and http://en.cppreference.com/w/cpp/regex/regex_search

Use strstr http://www.cplusplus.com/reference/clibrary/cstring/strstr/ , with that function you will get 2 pointers, now you should compare them (if pointer1 < pointer2) if so, read all chars between them.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to split a string and keep the delimiters using boost::split? - c++

Related

Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

How can I find the first word in a vector of strings that matches a user given prefix?

Keep quotation marks in a formatted list Racket

Regular expression validation fails while egrep validates just fine

Get String Between 2 Strings

Categories

Resources