Understanding c++ regex by a simple example - c++

I wrote the following simple example:
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string str("1231");
std::regex r("^(\\d)");
std::smatch m;
std::regex_search(str, m, r);
for(auto v: m) std::cout << v << std::endl;
}
DEMO
and got confused by its behavior. If I understood the purpose of the match_result from there correctly, the only one 1 should have been printed. Actually:
If successful, it is not empty and contains a series of sub_match
objects: the first sub_match element corresponds to the entire match,
and, if the regex expression contained sub-expressions to be matched ([...])
The string passed to the function doesn't match the regex, therefore we should not have had the entire match.
What did I miss?

You still get the entire match but the entire match does not fit the entire string it fits the entire regex.
For example consider this:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string str("1231");
std::regex r("^(\\d)\\d"); // entire match will be 2 numbers
std::smatch m;
std::regex_search(str, m, r);
for(auto v: m)
std::cout << v << std::endl;
}
Output:
12
1
The entire match (first sub_match) is what the entire regex matches against (part of the string).
The second sub_match is the first (and only) capture group
Looking at your original regex
std::regex r("^(\\d)");
|----| <- entire expression (sub_match #0)
std::regex r("^(\\d)");
|---| <- first capture group (sub_match #1)
That is where the two sub_matches come from.

From here
Returns whether **some** sub-sequence in the target sequence (the subject)
matches the regular expression rgx (the pattern). The target sequence is
either s or the character sequence between first and last, depending on
the version used.
So regex_search will search for anything in the input string that matches the regex. The whole string doesnt have to match, just part of it.
However, if you were to use regex_match, then the entire string must match.

Related

How to search and delete characters in string

I'm trying to convert .fsp files to strings but new .fsp file is very abnormal. It contains some undesirable characters that I want to delete from string. How can I make it?
I have tried to search char in string and delete them but I dont know how to make it.
The string looks like this:
string s;
s = 144˙037˙412˙864;
and I need to make it just like that
s = 144037412864;
So I except result like this:
string s = 144037412864;
Thank you for help.
We can use the remove-erase idiom to remove unnecessary characters from the string! There's a function in <algorithm> called remove_if. What remove_if does is it removes elements that match some predicate. remove_if returns a iterator pointing to the new end of the container after all elements have been removed. I'll show you how to write a function that does the job!
#include <algorithm>
#include <string>
void erase_ticks(std::string& s) {
// Returns true for characters that should be removed
auto condition = [](char c) { return c == '`'; };
// Removes characters that match the condition,
// and returns the new endpoint of the string
auto new_end = std::remove_if(s.begin(), s.end(), condition);
// Erases characters from the new endpoint to the current endpoint
s.erase(new_end, s.end());
}
We can use this in main, and it works just as expected!
#include <iostream>
int main() {
std::string s("123`456`789");
std::cout << s << '\n'; // prints 123`456`789
erase_ticks(s);
std::cout << s << '\n'; // prints 123456789
}
This problem has two parts, first we need to identify any characters in the string which we don't want. From your use case it seems that anything that is not numeric needs to go. This is simple enough as the standard library defines a function std::isdigit (simply add the following inclusion "#include <locale>") which takes a character and returns a bool which indicates whether or not the character is numeric.
Second we need a way to quickly and cleanly remove all occurrences of these from the string. Thus we can use the 'Erase Remove' idiom to iterate through the string and do what we want.
string s = "123'4'5";
s.erase(std::remove_if(s.begin(), s.end(), [](char x)->bool {return !std::isdigit(x);}), s.end());
In the snippit above we're calling erase on the string which takes two iterators, the first refers to where we want to begin to delete from and the second tells the call where we want to delete to. The magic in this trick is actually all in the call to remove_if (include "#include <algorithm>" for it). remove_if actually works by shifting the elements (or characters) of string forward to the end of the string.
So "123'4'5'" becomes "12345'''", then it returns an iterator to where it shifted these characters to which is then passed to erase to tell it remove the characters starting here. In the end we're left with "12345" as expected.
Edit: Forgot to mention, remove_if also takes a predicate here I'm using a lambda which takes a character and returns a bool.

Regex matches under g++ 4.9 but fails under g++-5.3.1

I am tokenizing a string with a regex; this works normally under g++-4.9, but fails under g++-5.3.1.
I have the following txt file:
0001-SCAND ==> "Scandaroon" (from Philjumba)
0002-KINVIN ==> "King's Vineyard" (from Philjumba)
0003-HANNI ==> "Hannibal: Rome vs. Carthage" (from Philjumba)
0004-LOX ==> "Lords of Xidit" (from Philjumba)
which I am tokenizing using regular expressions, by spaces, quotation marks pairs and parentheses pairs. For example, the first line should be tokenized as follows:
0001-SCAND
==>
"Scandaroon"
(from Philjumba)
I have written the following std::regex:
std::regex FPAT("(\\S+)|(\"[^\"]*\")|(\\([^\\)]+\\))";
And I am tokenizing the string with:
std::vector<std::string>
split( const std::string & input, const std::regex & regex ) {
std::sregex_token_iterator
first{input.begin(), input.end(), regex, 0},
last;
return {first, last};
}
This returns the matches. Under g++-4.9 the string is tokenized as requested, but under g++-5.3.1 it's tokenized as follows:
0001-SCAND
==>
"Scandaroon"
(from
Philjumba)
or the third line is tokenized as follows:
0003-HANNI
==>
"Hannibal:
Rome
vs.
Carthage"
(from
Philjumba)
What could the issue be?
edit: I am calling the function as follows:
std::string line("0001-SCAND ==> \"Scandaroon\" (from Philjumba)");
auto elems = split( line, FPAT );
edit: following feedback from #xaxxon, I replaced returning the iterator by a vector, but it's still not working correctly under g++-5.3.
std::vector<std::string>
split( const std::string & input, const std::regex & regex ) {
std::sregex_token_iterator
first{input.begin(), input.end(), regex, 0},
last;
std::vector< std::string > elems;
elems.reserve( std::distance(first,last) );
for ( auto it = first; it != last; ++ it ) {
//std::cout << (*it) << std::endl;
elems.push_back( *it );
}
return elems;
}
Regular expression is Eager
so for a regular expression "Set|SetValue" and the text "SetValue", regex founds "Set".
You have to choose order carefully:
std::regex FPAT(R"(("[^\"]*\")|(\([^\)])+\)|(\S+))");
\S+ at the end to be the last considered.
An other alternative is to use not the default option (see http://en.cppreference.com/w/cpp/regex/syntax_option_type)
and use std::::regex::extended
std::regex FPAT(R"((\S+)|("[^\"]*\")|(\([^\)])+\))", std::::regex::extended);
So it seems that g++-5.3.1 has fixed a bug since g++-4.9 in this regard.
You don't post enough for me to know for sure (you updated it showing you are calling it with an lvalue, so this post probably doesn't pertain, but I'll leave it up unless people want me to take it down), but if you're doing what I did, you forgot that the iterators are into the source string and that string is no longer valid.
You could remove the const from input, but it's so damn convenient to be able to put an rvalue there, so.....
Here's what I do to avoid this - I return a unique_ptr to something that looks like the results, but I hide the actual source string along with it so the strsing can't go away before I'm done using it. This is likely UB, but I think it will work virtually all the time:
// Holds a regex match as well as the original source string so the matches remain valid as long as the
// caller holds on to this object - but it acts just like a std::smatch
struct MagicSmatch {
std::smatch match;
std::string data;
// constructor makes a copy of the string and associates
// the copy's lifetime with the iterators into the string (the smatch)
MagicSmatch(const std::string & data) : data(data)
{}
};
// this deleter knows about the hidden string and makes sure to delete it
// this cast is probably UB because std::smatch isn't a standard layout type
struct MagicSmatchDeleter {
void operator()(std::smatch * smatch) {
delete reinterpret_cast<MagicSmatch *>(smatch);
}
};
// the caller just thinks they're getting a smatch ptr.. but we know the secret
std::unique_ptr<std::smatch, MagicSmatchDeleter> regexer(const std::regex & regex, const std::string & source)
{
auto magic_smatch = new MagicSmatch(source);
std::regex_search(magic_smatch->data, magic_smatch->match, regex);
return std::unique_ptr<std::smatch, MagicSmatchDeleter>(reinterpret_cast<std::smatch *>(magic_smatch));
}
as long as you call it as auto results = regexer(....) then it's quite easy to use, though results is a pointer, not a proper smatch, so the [] syntax doesn't work as nicely.

Parse string into and unknown amount of regex groups in C++

I know the exact format of the text I should be getting. In particular, it should match a regex with a variable number of groups.
I want to use the C++ regex library to determine (a) if it is valid text, and (b) to parse those groups into a vector. How can I do this? I can find examples online to do (a), but not (b).
#include <string>
#include <regex>
#include <vector>
bool parse_this_text(std::string & text, std::vector<std::string> & group) {
// std::string text_regex = "^([a-z]*)(,[0-9]+)*$"
// if the text matches the regex, return true and parse each group into the vector
// else return false
???
}
Such that the following lines of code return the expected results.
std::vector<std::string> group;
parse_this_text("green,1", group);
// should return true with group = {"green", ",1"};
parse_this_text("yellow", group);
// should return true with group = {"yellow"};
parse_this_text("red,1,2,3", group);
// should return true with group = {"red", ",1", ",2", ",3"};
parse_this_text("blue,1.0,3.0,1,a", group);
// should return false (since it doesn't match the regex)
Thanks!
(?=^([a-zA-Z]*)(?:\,\d+)+$)^.*?(?:((?:\,\d+)+)).*?$
You can use this.This will first validate using lookahead and then return 2 groups.
1) containing name
2) containing all the rest of integers (This can be easily split) or you can use re.findall here
Though it doesnot answer your question fully , it might be of help.
Have a look.
http://regex101.com/r/wE3dU7/3
One option is to scan the string twice, the first time to check for validity and the second time to split it into fields. With the example in the OP, you don't really need regexen to split the line, once you know that it is correct; you can simply split on commas. But for the sake of exposition, you could use a std::regex_token_iterator (assuming you have a C++ library which supports those), something like this:
bool parse_this_text(const std::string& s, std::vector<std::string>& result) {
static const std::regex check("[[:alpha:]][[:alnum:]]*(,[[:digit:]])*",
std::regex_constants::nosubs);
static const std::regex split(",");
if (!std::regex_match(s, check))
return false;
std::sregex_token_iterator tokens(s.begin(), s.end(), split, -1);
result.clear();
std::copy(tokens, std::sregex_token_iterator(), std::back_inserter(result));
return true;
}
For more complicated cases, or applications in which the double scan is undesired, you can tokenize using successive calls to std::regex_search(), supplying the end of the previous match as the starting point, and std::regex_constants::continuous as the match flags; that will anchor each search to the character after the previous match. You could, in that case, use a std::regex_iterator, but I'm not convinced that the resulting code is any simpler.

C++11 regex_token_iterator

Hmm... I thought I understood regexes, and I thought I understood iterators, but C++11's regex implementation has me puzzled...
One area I don't understand: Reading about regex token iterators, I came across the following sample code:
#include <fstream>
#include <iostream>
#include <algorithm>
#include <iterator>
#include <regex>
int main()
{
std::string text = "Quick brown fox.";
// tokenization (non-matched fragments)
// Note that regex is matched only two times: when the third value is obtained
// the iterator is a suffix iterator.
std::regex ws_re("\\s+"); // whitespace
std::copy( std::sregex_token_iterator(text.begin(), text.end(), ws_re, -1),
std::sregex_token_iterator(),
std::ostream_iterator<std::string>(std::cout, "\n"));
...
}
I don't understand how the following output:
Quick
brown
fox.
is being created by the std::copy() function above. I see no loop, so I am puzzled as how the iteration is occurring. Or put another way, how is more than one line of output being generated?
std::copy copies elements from an input range into an output range. In your program, the input range is the three tokens extracted using the regular expression delimiter. These are the three words that are printed to the output. The output range is ostream_iterator which simply takes each element it is given and writes the element to an output stream.
If you step through std::copy using your debugger, you will see that it loops over the elements of the input range.

boost::regex_search - boost kills my brain cells, again

Good programmers keep simple things easy right?
And it's not like the boost documentation makes your life less uneasy...
All I want is an implementation for:
// fulfils the function of a regex matching where the pattern may match a
// substring instead of the entire string
bool search( std::string, std::string, SomeResultType )
So it can be used as in:
std::string text, pattern;
SomeResultsType match;
if( search( text, pattern, match ) )
{
std::string result = match[0];
if( match[1].matched )
// where this is the second capture group, not recapturing the same group
std::string secondMatch = match[1];
}
I want my client code not to be bothered with templates and iterators... I know, I'm a wuss. After peering for an hour over the template spaghetti in the boost docs for doing something so simple, I feel like my productivity is seriously getting hampered and I don't feel like I've learned anything from it.
boost::regex_match does it pretty simple with boost::cmatch, except that it only matches the whole string, so I've been adapting all my patterns to match the whole strings, but I feel that it is a dirty hack and would prefer some more proper solution. If I would have known it would take this long, I would have stuck with regex_match
Also welcome, a copy of Reading boost documentation for dummies
Next week in Keep it simple and easy with boost, function binders! No, just kidding, I wouldn't do that to anyone.
Thanks for all help
I think you want regex_search: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/ref/regex_search.html
Probably this overload is the one you want:
bool regex_search(const basic_string& s,
match_results::const_iterator, Allocator>& m,
const basic_regex& e,
match_flag_type flags = match_default);
That seems to match what you wanted - SomeResultsType is smatch, and you need to convert your pattern to a regex first.
On Windows, you can use the .NET Regex class:
Example (copied from the linked page):
#using <System.dll>
using namespace System;
using namespace System::Text::RegularExpressions;
int main()
{
// Define a regular expression for repeated words.
Regex^ rx = gcnew Regex( "\\b(?<word>\\w+)\\s+(\\k<word>)\\b",static_cast<RegexOptions>(RegexOptions::Compiled | RegexOptions::IgnoreCase) );
// Define a test string.
String^ text = "The the quick brown fox fox jumped over the lazy dog dog.";
// Find matches.
MatchCollection^ matches = rx->Matches( text );
// Report the number of matches found.
Console::WriteLine( "{0} matches found.", matches->Count );
// Report on each match.
for each (Match^ match in matches)
{
String^ word = match->Groups["word"]->Value;
int index = match->Index;
Console::WriteLine("{0} repeated at position {1}", word, index);
}
}