In C++ STL, how do I remove non-numeric characters from std::string with regex_replace? - c++

Using the C++ Standard Template Library function regex_replace(), how do I remove non-numeric characters from a std::string and return a std::string?
This question is not a duplicate of question
747735
because that question requests how to use TR1/regex, and I'm
requesting how to use standard STL regex, and because the answer given
is merely some very complex documentation links. The C++ regex
documentation is extremely hard to understand and poorly documented,
in my opinion, so even if a question pointed out the standard C++
regex_replace
documentation,
it still wouldn't be very useful to new coders.

// assume #include <regex> and <string>
std::string sInput = R"(AA #-0233 338982-FFB /ADR1 2)";
std::string sOutput = std::regex_replace(sInput, std::regex(R"([\D])"), "");
// sOutput now contains only numbers
Note that the R"..." part means raw string literal and does not evaluate escape codes like a C or C++ string would. This is very important when doing regular expressions and makes your life easier.
Here's a handy list of single-character regular expression raw literal strings for your std::regex() to use for replacement scenarios:
R"([^A-Za-z0-9])" or R"([^A-Za-z\d])" = select non-alphabetic and non-numeric
R"([A-Za-z0-9])" or R"([A-Za-z\d])" = select alphanumeric
R"([0-9])" or R"([\d])" = select numeric
R"([^0-9])" or R"([^\d])" or R"([\D])" = select non-numeric

Regular expressions are overkill here.
#include <algorithm>
#include <iostream>
#include <iterator>
#include <string>
inline bool not_digit(char ch) {
return '0' <= ch && ch <= '9';
}
std::string remove_non_digits(const std::string& input) {
std::string result;
std::copy_if(input.begin(), input.end(),
std::back_inserter(result),
not_digit);
return result;
}
int main() {
std::string input = "1a2b3c";
std::string result = remove_non_digits(input);
std::cout << "Original: " << input << '\n';
std::cout << "Filtered: " << result << '\n';
return 0;
}

The accepted answer if fine for the specifics of the given sample.
But it will fail for a number such as "-12.34" (it would result in "1234").
(note how the sample could be negative numbers)
Then the regex should be:
(-|\+)?(\d)+(.(\d)+)*
explanation: (optional ( "-" or "+" )) with (a number, repeated 1 to n times) with (optionally end's with: ( a "." followed by (a number, repeated 1 to n times) )
A bit over-reaching, but I was looking for this and the page showed up first in my search, so I'm adding my answer for future searches.

Related

How do I make an alphabetized list of all distinct words in a file with the number of times each word was used?

I am writing a program using Microsoft Visual C++. In the program I must read in a text file and print out an alphabetized list of all distinct words in that file with the number of times each word was used.
I have looked up different ways to alphabetize a string but they do not work with the way I have my string initialized.
// What is inside my text file
Any experienced programmer engaged in writing programs for use by others knows
that, once his program is working correctly, good output is a must. Few people
really care how much time and trouble a programmer has spent in designing and
debugging a program. Most people see only the results. Often, by the time a
programmer has finished tackling a difficult problem, any output may look
great. The programmer knows what it means and how to interpret it. However,
the same cannot be said for others, or even for the programmer himself six
months hence.
string lines;
getline(input, lines); // Stores what is in file into the string
I expect an alphabetized list of words with the number of times each word was used. So far, I do not know how to begin this process.
It's rather simple, std::map automatically sorts based on key in the key/value pair you get. The key/value pair represents word/count which is what you need. You need to do some filtering for special characters and such.
EDIT: std::stringstream is a nice way of splitting std::string using whitespace delimiter as it's the default delimiter. Therefore, using stream >> word you will get whitespace-separated words. However, this might not be enough due to punctuation. For example: Often, has comma which we need to filter out. Therefore, I used std::replaceif which replaces puncts and digits with whitespaces.
Now a new problem arises. In your example, you have: "must.Few" which will be returned as one word. After replacing . with we have "must Few". So I'm using another stringstream on the filtered "word" to make sure I have only words in the final result.
In the second loop you will notice if(word == "") continue;, this can happen if the string is not trimmed. If you look at the code you will find out that we aren't trimming after replacing puncts and digits. That is, "Often," will be "Often " with trailing whitespace. The trailing whitespace causes the second loop to extract an empty word. This is why I added the condition to ignore it. You can trim the filtered result and then you wouldn't need this check.
Finally, I have added ignorecase boolean to check if you wish to ignore the case of the word or not. If you wish to do so, the program will simply convert the word to lowercase and then add it to the map. Otherwise, it will add the word the same way it found it. By default, ignorecase = true, if you wish to consider case, just call the function differently: count_words(input, false);.
Edit 2: In case you're wondering, the statement counts[word] will automatically create key/value pair in the std::map IF there isn't any key matching word. So when we call ++: if the word isn't in the map, it will create the pair, and increment value by 1 so you will have newly added word. If it exists already in the map, this will increment the existing value by 1 and hence it acts as a counter.
The program:
#include <iostream>
#include <map>
#include <sstream>
#include <cstring>
#include <cctype>
#include <string>
#include <iomanip>
#include <algorithm>
std::string to_lower(const std::string& str) {
std::string ret;
for (char c : str)
ret.push_back(tolower(c));
return ret;
}
std::map<std::string, size_t> count_words(const std::string& str, bool ignorecase = true) {
std::map<std::string, size_t> counts;
std::stringstream stream(str);
while (stream.good()) {
// wordW may have multiple words connected by special chars/digits
std::string wordW;
stream >> wordW;
// filter special chars and digits
std::replace_if(wordW.begin(), wordW.end(),
[](const char& c) { return std::ispunct(c) || std::isdigit(c); }, ' ');
// now wordW may have multiple words seperated by whitespaces, extract them
std::stringstream word_stream(wordW);
while (word_stream.good()) {
std::string word;
word_stream >> word;
// ignore empty words
if (word == "") continue;
// add to count.
ignorecase ? counts[to_lower(word)]++ : counts[word]++;
}
}
return counts;
}
void print_counts(const std::map<std::string, size_t>& counts) {
for (auto pair : counts)
std::cout << std::setw(15) << pair.first << " : " << pair.second << std::endl;
}
int main() {
std::string input = "Any experienced programmer engaged in writing programs for use by others knows \
that, once his program is working correctly, good output is a must.Few people \
really care how much time and trouble a programmer has spent in designing and \
debugging a program.Most people see only the results.Often, by the time a \
programmer has finished tackling a difficult problem, any output may look \
great.The programmer knows what it means and how to interpret it.However, \
the same cannot be said for others, or even for the programmer himself six \
months hence.";
auto counts = count_words(input);
print_counts(counts);
return 0;
}
I have tested this with Visual Studio 2017 and here is the part of the output:
a : 5
and : 3
any : 2
be : 1
by : 2
cannot : 1
care : 1
correctly : 1
debugging : 1
designing : 1
As others have already noted, an std::map handles the counting you care about quite easily.
Iostreams already have a tokenize to break an input stream up into words. In this case, we want to to only "think" of letters as characters that can make up words though. A stream uses a locale to make that sort of decision, so to change how it's done, we need to define a locale that classifies characters as we see fit.
struct alpha_only: std::ctype<char> {
alpha_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table() {
// everything is white space
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
// except lower- and upper-case letters, which are classified accordingly:
std::fill(&rc['a'], &rc['z'], std::ctype_base::lower);
std::fill(&rc['A'], &rc['Z'], std::ctype_base::upper);
return &rc[0];
}
};
With that in place, we tell the stream to use our ctype facet, then simply read words from the file and count them in the map:
std::cin.imbue(std::locale(std::locale(), new alpha_only));
std::map<std::string, std::size_t> counts;
std::string word;
while (std::cin >> word)
++counts[to_lower(word)];
...and when we're done with that, we can print out the results:
for (auto w : counts)
std::cout << w.first << ": " << w.second << "\n";
Id probably start by inserting all of those words into an array of strings, then start with the first index of the array and compare that with all of the other indexes if you find matches, add 1 to a counter and after you went through the array you could display the word you were searching for and how many matches there were and then go onto the next element and compare that with all of the other elements in the array and display etc. Or maybe if you wanna make a parallel array of integers that holds the number of matches you could do all the comparisons at one time and the displays at one time.
EDIT:
Everyone's answer seems more elegant because of the map's inherent sorting. My answer functions more as a parser, that later sorts the tokens. Therefore my answer is only useful to the extent of a tokenizer or lexer, whereas Everyone's answer is only good for sorted data.
You first probably want to read in the text file. You want to use a streambuf iterator to read in the file(found here).
You will now have a string called content, which is the content of you file. Next you will want to iterate, or loop, over the contents of this string. To do that you'll want to use an iterator. There should be a string outside of the loop that stores the current word. You will iterate over the content string, and each time you hit a letter character, you will add that character to your current word string. Then, once you hit a space character, you will take that current word string, and push it back into the wordString vector. (Note: that means that this will ignore non-letter characters, and that only spaces denote word separation.)
Now that we have a vector of all of our words in strings, we can use std::sort, to sort the vector in alphabetical order.(Note: capitalized words take precedence over lowercase words, and therefore will be sorted first.) Then we will iterate over our vector of stringWords and convert them into Word objects (this is a little heavy-weight), that will store their appearances and the word string. We will push these Word objects into a Word vector, but if we discover a repeat word string, instead of adding it into the Word vector, we'll grab the previous entry and increment its appearance count.
Finally, once this is all done, we can iterate over our Word object vector and output the word followed by its appearances.
Full Code:
#include <vector>
#include <fstream>
#include <iostream>
#include <streambuf>
#include <algorithm>
#include <string>
class Word //define word object
{
public:
Word(){appearances = 1;}
~Word(){}
int appearances;
std::string mWord;
};
bool isLetter(const char x)
{
return((x >= 'a' && x <= 'z') || (x >= 'A' && x <= 'Z'));
}
int main()
{
std::string srcFile = "myTextFile.txt"; //what file are we reading
std::ifstream ifs(srcFile);
std::string content( (std::istreambuf_iterator<char>(ifs) ),
( std::istreambuf_iterator<char>() )); //read in the file
std::vector<std::string> wordStringV; //create a vector of word strings
std::string current = ""; //define our current word
for(auto it = content.begin(); it != content.end(); ++it) //iterate over our input
{
const char currentChar = *it; //make life easier
if(currentChar == ' ')
{
wordStringV.push_back(current);
current = "";
continue;
}
else if(isLetter(currentChar))
{
current += *it;
}
}
std::sort(wordStringV.begin(), wordStringV.end(), std::less<std::string>());
std::vector<Word> wordVector;
for(auto it = wordStringV.begin(); it != wordStringV.end(); ++it) //iterate over wordString vector
{
std::vector<Word>::iterator wordIt;
//see if the current word string has appeared before...
for(wordIt = wordVector.begin(); wordIt != wordVector.end(); ++wordIt)
{
if((*wordIt).mWord == *it)
break;
}
if(wordIt == wordVector.end()) //...if not create a new Word obj
{
Word theWord;
theWord.mWord = *it;
wordVector.push_back(theWord);
}
else //...otherwise increment the appearances.
{
++((*wordIt).appearances);
}
}
//print the words out
for(auto it = wordVector.begin(); it != wordVector.end(); ++it)
{
Word theWord = *it;
std::cout << theWord.mWord << " " << theWord.appearances << "\n";
}
return 0;
}
Side Notes
Compiled with g++ version 4.2.1 with target x86_64-apple-darwin, using the compiler flag -std=c++11.
If you don't like iterators you can instead do
for(int i = 0; i < v.size(); ++i)
{
char currentChar = vector[i];
}
It's important to note that if you are capitalization agnostic simply use std::tolower on the current += *it; statement (ie: current += std::tolower(*it);).
Also, you seem like a beginner and this answer might have been too heavyweight, but you're asking for a basic parser and that is no easy task. I recommend starting by parsing simpler strings like math equations. Maybe make a calculator app.

Replace a Field in a Space Delimited String C++

I have a single-space delimited string and I want to replace field x.
I can repeatedly use find to locate the x - 1 and x spaces, then use substr to grab the two strings on either side, then concatenate the two sub strings and my replacement text.
But man that seems like an awful lot of work for something that should be simple. Is there a better solution-- one that doesn't require Boost?
Answer
I've cleaned up #Domenic Lokies answer below:
sting fieldReplace( const string input, const string outputField, int index )
{
vector< char > stringIndex( numeric_limits< int >::digits10 + 2 );
_itoa_s( index, stringIndex.begin()._Ptr, stringIndex.size(), 10 );
const string stringRegex( "^((?:\\w+ ){" ); //^((?:\w+ ){$index})\w+
return regex_replace( input, regex( stringRegex + stringIndex.begin()._Ptr + "})\\w+" ), "$1" + outputField );
}
(_itoa_s and _Ptr are MSVS only I believe, so you'll need to clean those up if you want code portability. )
You can do it using one of the string::replace methods:
Locate the position of the x-1-st space. You can do it by calling string::find repeatedly
Locate the position of the x-th space by calling string::find one more time
Calculate the length of the word being replaced by subtracting the first index from the second one
Call string::replace passing the first index, the length, and the replacement string.
Here is how you can implement this:
#include <iostream>
#include <string>
using namespace std;
int main() {
string s = "quick brown frog jumps over the lazy dog";
size_t start = -1;
int cnt = 3; // Word number three
do {
start = s.find(' ', start+1);
} while (start != string::npos && --cnt > 1);
size_t end = s.find(' ', start+1);
s.replace(start+1, end-start-1, "fox");
cout << s << endl;
return 0;
}
Demo on ideone.
Since C++11 you should use a Regular Expression for your purposes. If you are not using a compiler which supports C++11, you can take a look at Boost.Regex.
Never combine std::string::find with std::string::replace, that is just not a good style in a language like C++.
I have written a short example for you to show you how to use Regular Expressions in C++.
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string subject = "quick brown frog jumps over the lazy dog";
std::regex pattern("frog");
std::cout << std::regex_replace(subject, pattern, "fox");
}

C++ boost/regex regex_search

Consider the following string content:
string content = "{'name':'Fantastic gloves','description':'Theese gloves will fit any time period.','current':{'trend':'high','price':'47.1000'}";
I have never used regex_search and I have been searching around for ways to use it - I still do not quite get it. From that random string (it's from an API) how could I grab two things:
1) the price - in this example it is 47.1000
2) the name - in this example Fantastic gloves
From what I have read, regex_search would be the best approach here. I plan on using the price as an integer value, I will use regex_replace in order to remove the "." from the string before converting it. I have only used regex_replace and I found it easy to work with, I don't know why I am struggling so much with regex_search.
Keynotes:
Content is contained inside ' '
Content id and value is separated by :
Conent/value are separated by ,
Value of id's name and price will vary.
My first though was to locate for instance price and then move 3 characters ahead (':') and gather everything until the next ' - however I am not sure if I am completely off-track here or not.
Any help is appreciated.
boost::regex would not be needed. Regular expressions are used for more general pattern matching, whereas your example is very specific. One way to handle your problem is to break the string up into individual tokens. Here is an example using boost::tokenizer:
#include <iostream>
#include <string>
#include <boost/tokenizer.hpp>
#include <map>
int main()
{
std::map<std::string, std::string> m;
std::string content = "{'name':'Fantastic gloves','description':'Theese gloves will fit any time period.','current':{'trend':'high','price':'47.1000'}";
boost::char_separator<char> sep("{},':");
boost::tokenizer<boost::char_separator<char>> tokenizer(content, sep);
std::string id;
for (auto tok = tokenizer.begin(); tok != tokenizer.end(); ++tok)
{
// Since "current" is a special case I added code to handle that
if (*tok != "current")
{
id = *tok++;
m[id] = *tok;
}
else
{
id = *++tok;
m[id] = *++tok; // trend
id = *++tok;
m[id] = *++tok; // price
}
}
std::cout << "Name: " << m["name"] << std::endl;
std::cout << "Price: " << m["price"] << std::endl;
}
Link to live code.
As the string you are attempting to parse appears to be JSON (JavaScript Object Notation), consider using a specialized JSON parser.
You can find a comprehensive list of JSON parsers in many languages including C++ at http://json.org/. Also, I found a discussion on the merits of several JSON parsers for C++ in response to this SO question.

Equality of two strings

What is the easiest way, with the least amount of code, to compare two strings, while ignoring the following:
"hello world" == "hello world" // spaces
"hello-world" == "hello world" // hyphens
"Hello World" == "hello worlD" // case
"St pierre" == "saint pierre" == "St. Pierre" // word replacement
I'm sure this has been done before, and there are some libraries to do this kind of stuff, but I don't know any. This is in C++ preferably, but if there's a very short option in whatever other language, I'll want to hear about it too.
Alternatively, I'd also be interested in any library that could give a percentage of matching. Say, hello-world and hello wolrd are 97% likely to be the same meaning, just a hyphen and a mispelling.
Remove spaces from both strings.
Remove hyphens from both strings.
Convert both strings to lower case.
Convert all occurrences of “saint” and “st.” to “st”.
Compare strings like normal.
For example:
#include <cctype>
#include <string>
#include <algorithm>
#include <iostream>
static void remove_spaces_and_hyphens(std::string &s)
{
s.erase(std::remove_if(s.begin(), s.end(), [](char c) {
return c == ' ' || c == '-';
}), s.end());
}
static void convert_to_lower_case(std::string &s)
{
for (auto &c : s)
c = std::tolower(c);
}
static void
replace_word(std::string &s, const std::string &from, const std::string &to)
{
size_t pos = 0;
while ((pos = s.find(from, pos)) != std::string::npos) {
s.replace(pos, from.size(), to);
pos += to.size();
}
}
static void replace_words(std::string &s)
{
replace_word(s, "saint", "st");
replace_word(s, "st.", "st");
}
int main()
{
// Given two strings:
std::string s1 = "Hello, Saint Pierre!";
std::string s2 = "hELlO,St.PiERRe!";
// Remove spaces and hyphens.
remove_spaces_and_hyphens(s1);
remove_spaces_and_hyphens(s2);
// Convert to lower case.
convert_to_lower_case(s1);
convert_to_lower_case(s2);
// Replace words...
replace_words(s1);
replace_words(s2);
// Compare.
std::cout << (s1 == s2 ? "Equal" : "Doesn't look like equal") << std::endl;
}
There is a way, of course, to code this more efficiently, but I recommend you start with something working and optimize it only when it proves to be a bottleneck.
It also sounds like you might be interested in string similarity algorithms like “Levenshtein distance”. Similar algorithms are used, for example, by search engine or editors to offer suggestion on spell correction.
I dont know any library, but for equlity, if speed is not rpoblem, you can do char-by-char compare and ignore "special" characters (respectively move iterator further in text).
As for comparing texts, you can use simple Levenshtein distance.
For spaces and hyphens, just replace all spaces/hyphens in the string and do a comparison. For case, convert all text to upper or lower case and do the comparison. For word replacement, you would need a dictionary of words with the key being the abbreviation and the value being the replacement word. You may also consider using the Levenshtein Distance algorithm for showing how similar one phrase is to another. If you want statistical probablility of how close a word/phrase is to another word/phrase, you will need sample data to do a comparison.
QRegExp is what you are looking for. It won't print out the percentages, but you can make some pretty slick ways of comparing one string to another, and finding the number of matches of one string to another.
Regular Expressions are available with almost ever language out there. I like GSkinner's RegEx page for learning regular expressions.
http://qt-project.org/doc/qt-4.8/qregexp.html
Hope that helps.
for the first 3 requirments,
remove all spaces/hypens of string (or replace it to a char, e.g'')
"hello world" --> "helloworld"
compare them ignore case.
Case insensitive string comparison in C++
for the last requirment, it is more compliate.
first you need a dictionary, which in KV structure:
'St.': 'saint'
'Mr.': 'mister'
second use boost token to seperate the string, and fetch then in the KV Store
then replace the token to the string, but it may in low performance:
http://www.boost.org/doc/libs/1_53_0/libs/tokenizer/tokenizer.htm

How would I perform this text pattern matching

Migrated from [Spirit-general] list
Good morning,
I'm trying to parse a relatively simple pattern across 4 std::strings,
extracting whatever the part which matches the pattern into a separate
std::string.
In an abstracted sense, here is what I want:
s1=<string1><consecutive number>, s2=<consecutive number><string2>,
s3=<string1><consecutive number>, s4=<consecutive number><string2>
Less abstracted:
s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"
Actual contents:
s1="lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu
j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx", s2="xzljlkxvc
jlkjxzvl jxcvljzx lvjlkj wre 2 cheese", s3="apple 3", s4="kxclvj
xcvjlxk jcvljxlck jxcvl 4 cheese"
How would I perform this pattern matching?
Thanks for all suggestions,
Alec Taylor
Update 2
Here is a really simple explanation I just figured out to explain the
problem I am trying to solve:
std::string s1=garbagetext1+number1+name1+garbagetext4;
std::string s3=garbagetext2+(number1+2)+name1+garbagetext5;
std::string s5=garbagetext3+(number1+4)+name1+garbagetext6;
Edit for context:
Feel free to add it to stackoverflow (I've been having some trouble
posting there)
I can't give you what I've done so far, because I wasn't sure if it
was within the capabilities of the boost::spirit libraries to do what
I'm trying to do
Edit: Re Update2
Here is a really simple explanation I just figured out to explain the
problem I am trying to solve:
std::string s1=garbagetext1+number1+name1+garbagetext4;
std::string s3=garbagetext2+(number1+2)+name1+garbagetext5;
std::string s5=garbagetext3+(number1+4)+name1+garbagetext6;
It starts looking like a job for:
Tokenizing the 'garbage text/names' - you could make a symbol table of sorts on the fly and use it to match patterns (spirit Lex and Qi's symbol table (qi::symbol) could facilitate it, but I feel you could write that in any number of ways)
conversely, use regular expressions, as suggested before (below, and at least twice in mailing list).
Here's a simple idea:
(\d+) ([a-z]+).*?(\d+) \2
\d+ match a sequence of digits in a "(subexpression)" (NUM1)
([a-z]+) match a name (just picked a simple definition of 'name')
.*? skip any length of garbage, but as little as possible before starting subsequent match
\d+ match another number (sequence of digits) (NUM2)
\2 followed by the same name (backreference)
You can see how you'd already be narrowing the list of matches to inspect down to 'potential' hits. You'd only have to /post-validate/ to see that NUM2 == NUM1+2
Two notes:
Add (...)+ around the tail part to allow repeated matching of patterns
(\d+) ([a-z]+)(.*?(\d+) \2)+
You may wish to make the garbage skip (.*?) aware of separators (by doing negative zerowidth assertions) to avoid more than 2 skipping delimiters (e.g. s\d+=" as a delimiting pattern). I leave it out of scope for clarity now, here's the gist:
((?!s\d+=").)*? -- beware of potential performance degradation
Alec, The following is a show-case of how to do a wide range of things in Boost Spirit, in the context of answering your question.
I had to make assumptions about what is required input structure; I assumed
whitespace was strict (spaces as shown, no newlines)
the sequence numbers should be in increasing order
the sequence numbers should recur exactly in the text values
the keywords 'apple' and 'cheese' are in strict alternation
whether the keyword comes before or after the the sequence number in the text value, is also in strict alternation
Note There are about a dozen places in the implementation below, where significantly less complex choices could possibly have been made. For example, I could have hardcoded the whole pattern (as a de facto regex?), assuming that 4 items are always expected in the input. However I wanted to
make no more assumptions than necessary
learn from the experience. Especially the topic of qi::locals<> and inherited attributes have been on my agenda for a while.
However, the solution allows a great deal of flexibility:
the keywords aren't hardcoded, and you could e.g. easily make the parser accept both keywords at any sequence number
a comment shows how to generate a custom parsing exception when the sequence number is out of sync (not the expected number)
different spellings of the sequence numbers are currently accepted (i.e. s01="apple 001" is ok. Look at Unsigned Integer Parsers for info on how to tune that behaviour)
the output structure is either a vector<std::pair<int, std::string> > or a vector of struct:
struct Entry
{
int sequence;
std::string text;
};
both versions can be switched with the single #if 1/0 line
The sample uses Boost Spirit Qi for parsing.
Conversely, Boost Spirit Karma is used to display the result of parsing:
format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed)
The output for the actual contents given in the post is:
parsed: s1="apple 1", s2="2 cheese", s3="apple 3", s4="4 cheese"
On to the code.
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
namespace qi = boost::spirit::qi;
namespace karma = boost::spirit::karma;
namespace phx = boost::phoenix;
#if 1 // using fusion adapted struct
#include <boost/fusion/adapted/struct.hpp>
struct Entry
{
int sequence;
std::string text;
};
BOOST_FUSION_ADAPT_STRUCT(Entry, (int, sequence)(std::string, text));
#else // using boring std::pair
#include <boost/fusion/adapted/std_pair.hpp> // for karma output generation
typedef std::pair<int, std::string> Entry;
#endif
int main()
{
std::string input =
"s1=\"lxckvjlxcjvlkjlkje xvcjxzlvcj wqrej lxvcjz ljvl;x czvouzxvcu"
"j;ljfds apple 1 xcvljxclvjx oueroi xcvzlkjv; zjx\", s2=\"xzljlkxvc"
"jlkjxzvl jxcvljzx lvjlkj wre 2 cheese\", s3=\"apple 3\", s4=\"kxclvj"
"xcvjlxk jcvljxlck jxcvl 4 cheese\"";
using namespace qi;
typedef std::string::const_iterator It;
It f(input.begin()), l(input.end());
int next = 1;
qi::rule<It, std::string(int)> label;
qi::rule<It, std::string(int)> value;
qi::rule<It, int()> number;
qi::rule<It, Entry(), qi::locals<int> > assign;
label %= qi::raw [
( eps(qi::_r1 % 2) >> qi::string("apple ") > qi::uint_(qi::_r1) )
| qi::uint_(qi::_r1) > qi::string(" cheese")
];
value %= '"'
>> qi::omit[ *(~qi::char_('"') - label(qi::_r1)) ]
>> label(qi::_r1)
>> qi::omit[ *(~qi::char_('"')) ]
>> '"';
number %= qi::uint_(phx::ref(next)++) /*| eps [ phx::throw_(std::runtime_error("Sequence number out of sync")) ] */;
assign %= 's' > number[ qi::_a = _1 ] > '=' > value(qi::_a);
std::vector<Entry> parsed;
bool ok = false;
try
{
ok = parse(f, l, assign % ", ", parsed);
if (ok)
{
using namespace karma;
std::cout << "parsed:\t" << format((('s' << auto_ << "=\"" << auto_) << "\"") % ", ", parsed) << std::endl;
}
} catch(qi::expectation_failure<It>& e)
{
std::cerr << "Expectation failed: " << e.what() << " '" << std::string(e.first, e.last) << "'" << std::endl;
} catch(const std::exception& e)
{
std::cerr << e.what() << std::endl;
}
if (!ok || (f!=l))
std::cerr << "problem at: '" << std::string(f,l) << "'" << std::endl;
}
Provided you can use c++11 compiler, parsing these patterns is pretty simple using AXE†:
#include <axe.h>
#include <string>
template<class I>
void num_value(I i1, I i2)
{
unsigned n;
unsigned next = 1;
// rule to match unsigned decimal number and compare it with another number
auto num = axe::r_udecimal(n) & axe::r_bool([&](...){ return n == next; });
// rule to match a single word
auto word = axe::r_alphastr();
// rule to match space characters
auto space = axe::r_any(" \t\n");
// semantic action - print to cout and increment next
auto e_cout = axe::e_ref([&](I i1, I i2)
{
std::cout << std::string(i1, i2) << '\n';
++next;
});
// there are only two patterns in this example
auto pattern1 = (word & +space & num) >> e_cout;
auto pattern2 = (num & +space & word) >> e_cout;
auto s1 = axe::r_find(pattern1);
auto s2 = axe::r_find(pattern2);
auto text = s1 & s2 & s1 & s2 & axe::r_end();
text(i1, i2);
}
To parse the text simply call num_value(text.begin(), text.end()); No changes required to parse unicode strings.
† I didn't test it.
Look into Boost.Regex. I've seen an almost-identical poosting in boost-users and the solution is to use regexes for some of the match work.