BOOST Regex global Search behavior

BOOST Regex global Search behavior - c++

My question is about whether the boost regex engine can do "global searches".
I've tried and I can't get it to do it.
The match_results class contains the base pointer of the string, so after incrementing the
starting position manually then setting the match_flag_type to match_not_bob | match_prev_avail,
I would have thought the boost regex engine would be able to know it is in the middle of a string.
Since I'm using this engine in my software, I'd like to know if this engine can infact do this correctly and I'm doing something wrong, or global searching is not possible with this engine.
Below are sample code/output using BOOST regex, and an equivalent Perl script.
Edit: Just to clarify, in the below boost example the Start iterator is always treated as a boundry. The engine doesn't seem to consider text to the left of that position when making a match.
At least in this case.
7/22/2014 - The Solution for Global Search
Posting this update as the solution. Its not a workaround or kludge.
After googling 'regex_iterator' I knew that regex_iterator sees the text to the left of the
current search position. And, I came across all the same source code. One site (like the others)
had an passing simple explanation of how it works that said it calls 'regex_search()'
when the regex_iterator is incremented.
So down in the bowels of the regex_iterator class, I saw that it indeed called regex_search() when
the iterator was incremented ->Next().
This 'regex_search()' overload wasn't documented and comes in only 1 type.
It includes a BIDI parameter at the end named 'base'.
bool regex_search(BidiIterator first, BidiIterator last,
match_results<BidiIterator, Allocator>& m,
const basic_regex<charT, traits>& e,
match_flag_type flags,
BidiIterator base)
{
if(e.flags() & regex_constants::failbit)
return false;
re_detail::perl_matcher<BidiIterator, Allocator, traits> matcher(first, last, m, e, flags, base);
return matcher.find();
}
It appears the base is the wall to the left of the start BIDI from where initial lookbehind's could use to check conditions..
So, I tested it out and it seemed to work.
The bottom line is to set base BIDI to the start of the input, and put the start BIDI anywhere after.
Effectively, this is like setting the pos() variable in Perl.
And, to emulate global positional increment on a zero-length match, a simple conditional is all that's
needed:
Start = ( _M[0].length() == 0) ? _M[0].first + 1 : _M[0].second; (see below)
BOOST Regex 1.54 regex_search() using 'base' BIDI
Note - in this example, Start always = _M[0].second;
The regex is purposely unlike the two other examples (below it), to demonstrate in fact
the text from 'Base' to 'Start' is considered each time when matching this regex.
#typedef std::string::const_iterator SITR;
boost::regex Rx( "(?<=(.)).", regex_constants::perl );
regex_constants::match_flag_type Flags = match_default;
string str("0123456789");
SITR Start = str.begin();
SITR End = str.end();
SITR Base = Start;
boost::smatch _M;
while ( boost::regex_search( Start, End, _M, Rx, Flags, Base) )
{
string str1(_M[1].first, _M[1].second );
string str0(_M[0].first, _M[0].second );
cout << str1 << str0 << endl;
// This line implements the Perl global match flag m//g ->
Start = ( _M[0].length() == 0) ? _M[0].first + 1 : _M[0].second;
}
output:
01
12
23
34
45
56
67
78
89
Perl 5.10
use strict;
use warnings;
my $str = "0123456789";
while ( $str =~ /(?<=(..))/g )
{
print ("$1\n");
}
output:**
01
12
23
34
45
56
67
78
89
BOOST Regex 1.54 regex_search() no 'base'
string str("0123456789");
std::string::const_iterator Start = str.begin();
std::string::const_iterator End = str.end();
boost::regex Rx("(?<=(..))", regex_constants::perl);
regex_constants::match_flag_type Flags = match_default;
boost::smatch _M;
while ( boost::regex_search( Start, End, _M, Rx, Flags) )
{
string str(_M[1].first, _M[1].second );
cout << str << "\n";
Flags |= regex_constants::match_prev_avail;
Flags |= regex_constants::match_not_bob;
Start = _M[0].second;
}
output:
01
23
45
67
89

Updated in response to the comments Live On Coliru:
#include <boost/regex.hpp>
int main()
{
using namespace boost;
std::string str("0123456789");
std::string::const_iterator start = str.begin();
std::string::const_iterator end = str.end();
boost::regex re("(?<=(..))", regex_constants::perl);
regex_constants::match_flag_type flags = match_default;
boost::smatch match;
while (start<end &&
boost::regex_search(start, end, match, re, flags))
{
std::cout << match[1] << "\n";
start += 1; // NOTE
//// some smartness that should work for most cases:
// start = (match.length(0)? match[0] : match.prefix()).first + 1;
flags |= regex_constants::match_prev_avail;
flags |= regex_constants::match_not_bob;
std::cout << "at '" << std::string(start,end) << "'\n";
}
}
Prints:
01 at '123456789'
12 at '23456789'
23 at '3456789'
34 at '456789'
45 at '56789'
56 at '6789'
67 at '789'
78 at '89'
89 at '9'

Related

c++ regex_replace not doing intended substitution

The following code is intended to convert the )9 in the first line to a )*9.
The original string is printed unmodified by the last line.
std::string ss ("1 + (3+2)9 - 2 ");
std::regex ee ("(\\)\\d)([^ ]");
std::string result;
std::regex_replace (std::back_inserter(result), ss.begin(), ss.end(), ee, ")*$2");
std::cout << result;
This is based on a very similar example at: http://www.cplusplus.com/reference/regex/regex_replace/
MS Visual Studio Express 2013.

I see two issues: first, your capture group should only include the '9' portion of the string, and second the group you want to use for replacement is not $2, but $1:
std::string ss ("1 + (3+2)9 - 2 ");
static const std::regex ee ("\\)(\\d)");
std::string result;
std::regex_replace (std::back_inserter(result), ss.begin(), ss.end(), ee, ")*$1");
std::cout << result;
Output:
1 + (3+2)*9 - 2
Live Demo
Edit
It appears that you want a more general replacement.
That is, wherever there is a number followed by an open paren, e.g 1( or a close paren followed by a number, e.g. )1. You want an asterisk between the number and the paren.
In C++ we can do this with regex_replace, but we need two of them at this time of writing. We can kind of chain them together:
std::string ss ("1 + 7(3+2)9 - 2");
static const std::regex ee ("\\)(\\d+)");
static const std::regex e2 ("(\\d+)\\(");
std::string result;
std::regex_replace (std::back_inserter(result), ss.begin(), ss.end(), ee, ")*$1");
result = std::regex_replace (result, e2, "$1*(");
std::cout << result;
Output:
1 + 7*(3+2)*9 - 2
Live Demo2
Edit 2
Since you asked in another question how to turn this into one that can also capture spaces, here is a slight modification to handle possible spaces between the number and paren chars:
static const std::regex ee ("\\)\\s*(\\d+)");
static const std::regex e2 ("(\\d+)\\s*\\(");
Live Demo3

C++ RegExp and placeholders

I'm on C++11 MSVC2013, I need to extract a number from a file name, for example:
string filename = "s 027.wav";
If I were writing code in Perl, Java or Basic, I would use a regular expression and something like this would do the trick in Perl5:
filename ~= /(\d+)/g;
and I would have the number "027" in placeholder variable $1.
Can I do this in C++ as well? Or can you suggest a different method to extract the number 027 from that string? Also, I should convert the resulting numerical string into an integral scalar, I think atoi() is what I need, right?

You can do this in C++, as of C++11 with the collection of classes found in regex. It's pretty similar to other regular expressions you've used in other languages. Here's a no-frills example of how you might search for the number in the filename you posted:
const std::string filename = "s 027.wav";
std::regex re = std::regex("[0-9]+");
std::smatch matches;
if (std::regex_search(filename, matches, re)) {
std::cout << matches.size() << " matches." << std::endl;
for (auto &match : matches) {
std::cout << match << std::endl;
}
}
As far as converting 027 into a number, you could use atoi (from cstdlib) like you mentioned, but this will store the value 27, not 027. If you want to keep the 0 prefix, I believe you will need to keep this as a string. match above is a sub_match so, extract a string and convert to a const char* for atoi:
int value = atoi(match.str().c_str());

Ok, I solved using std::regex which for some reason I couldn't get to work properly when trying to modify the examples I found around the web. It was simpler than I thought. This is the code I wrote:
#include <regex>
#include <string>
string FileName = "s 027.wav";
// The search object
smatch m;
// The regexp /\d+/ works in Perl and Java but for some reason didn't work here.
// With this other variation I look for exactly a string of 1 to 3 characters
// containing only numbers from 0 to 9
regex re("[0-9]{1,3}");
// Do the search
regex_search (FileName, m, re);
// 'm' is actually an array where every index contains a match
// (equally to $1, $2, $2, etc. in Perl)
string sMidiNoteNum = m[0];
// This casts the string to an integer number
int MidiNote = atoi(sMidiNoteNum.c_str());

Here is an example using Boost, substitute the proper namespace and it should work.
typedef std::string::const_iterator SITR;
SITR start = str.begin();
SITR end = str.end();
boost::regex NumRx("\\d+");
boost::smatch m;
while ( boost::regex_search ( start, end, m, NumRx ) )
{
int val = atoi( m[0].str().c_str() )
start = m[0].second;
}

std::string search for numbers in a string & insert space before & after [duplicate]

This question already exists:
std::string search for numbers in a string & insert space before & after them [closed]
Closed 10 years ago.
I have this string:
string strInput = "33kfkdsfhk33 324234k334k 333 3 323434/545435436***33/rrrr34 e3mdgmflkgfdlglk3434424dfffff555555555555gggggg00000033lll-111111 1974-1-12";
I would like to format it as:
" 33 kfkdsfhk 33 324234 k 334 k 333 3 323434 / 545435436 * 33 /rrrr 34 e 3 mdgmflkgfdlglk 3434424 dfffff 555555555555 gggggg 00000033lll - 111111 1974 - 1 - 12 ";
That is, find a number and insert space before and after the number.
No Boost please... only standard C++ library.
This is what I tried, inserts space after number, i want to group all consecutive numbers to get desired output.
strInput = "33kfkdsfhk33 324234k334k 333 3 323434/545435436***33/rrrr34 e3mdgmflkgfdlglk3434424dfffff555555555555gggggg00000033lll-111111 1974-1-12";
for ( std::string::iterator it=strInput.begin(); it!=strInput.end(); ++it)
{
static bool flag = false;
if(isdigit(*it) && !flag)
{
strInput.insert(it,1,' ');
flag = true;
}
else
flag = false;
}

Your solution actually looks fairly good conceptually, but there is one major problem: After you insert into a string, all iterators pointing to it may be invalid, in particular your loop iterator it. That can lead to segfaults and all kinds of hard-to-explain bugs.
As an alternative solution, I would suggest not modifying the string you start with, but just reading from it and building a new one step by step, inserting spaces where you want them as you go along. This is really only a minor modification of your current code!
string strInput = ... // whatever;
string newString = "";
bool currentisdigit = false;
bool previouswasdigit = false;
for ( std::string::iterator it=strInput.begin(); it!=strInput.end(); ++it)
{
previouswasdigit = currentisdigit;
currentisdigit = isdigit(*it);
if(currentisdigit && !previouswasdigit)
newString.push_back(' ');
if(!currentisdigit && previouswasdigit)
newString.push_back(' ');
newString.push_back(*it);
}

How to extract a string that is present between two brackets?

For example if the string is:
XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]
The output should be:
20 BB EC 45 40 C8 97 20 84 8B 10
int main()
{
char input = "XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]";
char output[500];
// what to write here so that i can get the desired output as:
// output = "20 BB EC 45 40 C8 97 20 84 8B 10"
return 0;
}

In C, you could do this with a scanset conversion (though it's a bit RE-like, so the syntax gets a bit strange):
sscanf(input, "[%*[^]]][%[^]]]", second_string);
In case you're wondering how that works, the first [ matches an open bracket literally. Then you have a scanset, which looks like %[allowed_chars] or %[^not_allowed_chars]. In this case, you're scanning up to the first ], so it's %[^]]. In the first one, we have a * between the % and the rest of the conversion specification, which means sscanf will try to match that pattern, but ignore it -- not assign the result to anything. That's followed by a ] that gets matched literally.
Then we repeat essentially the same thing over again, but without the *, so the second data that's matched by this conversion gets assigned to second_string.
With the typo fixed and a bit of extra code added to skip over the initial XYZ ::, working (tested) code looks like this:
#include <stdio.h>
int main() {
char *input = "XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]";
char second_string[64];
sscanf(input, "%*[^[][%*[^]]][%[^]]]", second_string);
printf("content: %s\n", second_string);
return 0;
}

Just find the second [ and start extracting (or just printing) until next ]....

You can use string::substr if you are willing to convert to std::string
If you don't know the location of brackets, you can use string::find_last_of for the last bracket and again string::find_last_of to find the open bracket.

Well, say, your file looks like this:
XYZ ::[1][20 BB EC 45 40 C8 97 20 84 8B 10]
XYZ ::[1][Maybe some other text]
XYZ ::[1][Some numbers maybe: 123 98345 123 9-834 ]
XYZ ::[1][blah-blah-blah]
The code that will extract the data will look something like this:
#include <iostream>
#include <fstream>
using namespace std;
int main()
{
//opening the file to read from
std::ifstream file( "in.txt" );
if( !file.is_open() )
{
cout << "Cannot open the file";
return -1;
}
std::string in, out;
int blockNumber = 1;//Which bracket block we are looking for. We are currently looking for the second one.
while( getline( file, in ) )
{
int n = 0;//Variable for storing index in the string (where our target text starts)
int i = 0;//Counter for [] blocks we have encountered.
while( i <= blockNumber )
{
//What we are doing here is searching for the position of [ symbol, starting
//from the n + 1'st symbol of the string.
n = in.find_first_of('[', n + 1);
i++;
}
//Getting our data and printing it.
out = in.substr( n + 1, ( in.find_first_of(']', n) - n - 1) );
std::cout << out << std::endl;
}
return 0;
}
The output after executing this will be:
20 BB EC 45 40 C8 97 20 84 8B 10
Maybe some other text
Some numbers maybe: 123 98345 123 9-834
blah-blah-blah

The simplest solution is something along the lines of:
std::string
match( std::string const& input )
{
static boost::regex const matcher( ".*\\[[^]]*\\]\\[(.*)\\]" );
boost::smatch matched;
return regex_match( input, matched, matcher )
? matched[1]
: std::string();
}
The regular expression looks a bit complicated because you need to match
meta-characters, and because the compiler I use doesn't support raw
strings yet. (With raw strings, I think the expression would be
R"^(.*\[[^]]\]\[(.*)\])^". But I can't verify that.)
This returns an empty string in case there is no match; if you're sure
about the format, you might prefer to throw an exception. You can also
extend it to do as much error checking as necessary: in general, the
more you validate a text input, the better it is, but you didn't give
precise enough information about what was legal for me to fill it out
completely. (For your example string, for example, you might replace
the ".*" at the beginning of the regular expression with
"\\u{3}\\s*::": three upper case characters followed by zero or more
whitespace, then two ':'. Or the first [] group might be
"\\[\\d\\]", if you're certain it's always a single digit.

This could work for you in a very specific sense:
std::string str(input);
std::string output(input.find_last_of('['), input.find_last_of(']'));
out = output.c_str();
The syntax isnt quite correct so you will need to look that up. You probably need to define your question a little better as well as this will only work if you want the brcketed string at the end.

Using string library in C. I'll give a code snippet that process a single linewhich can be used in a loop that reads the file line by line. NOTE: string.h should be included
int length = strlen( input );
char* output = 0;
// Search
char* firstBr = strchr( input, '[' );
if( 0 != firstBr++ ) // check for null pointer
{
char* secondBr = strchr( firstBr, '[' );
// we don't need '['
if( 0 != secondBr++ )
{
int nOutLen = strlen( secondBr ) - 1;
if( 0 < nOutLen )
{
output = new char[nOutLen+1];
strncpy( output, secondBr, nOutLen );
output[ nOutLen ] = '\0';
}
}
}
if( 0 != output )
{
cout << output;
delete[] output;
output = 0;
}
else
{
cout << "Error!";
}

You could use this regex to get what is inside "<" and ">":
// Regex: "<%999[^>]>" (Max of 999 Bytes)
int n1 = sscanf(source, "<%999[^>]>", dest);

Regular expression library that returns all matches for multiple patterns in one run for C++?

I'm looking for a regular expression (or something else) library for C++ that would allow me to specify a number of patterns, run on a string and return the matching locations of all patterns.
For example:
Patterns {"abcd", "abcd"}
String {"abcd abce abcd"}
Result:
abcd matches: 0-3, 11-14
abce matches: 5-9
Anyone know of a such a library?

I recommend boost::xpressive http://www.boost.org/doc/libs/1_39_0/doc/html/xpressive.html.
One of possible solution:
string text = "abcd abce abcd";
static const sregex abcd = as_xpr("abcd"); // static - faster
sregex abce = sregex::compile( "abce" ) // compiled
sregex all = *(keep(abcd) | keep(abce));
smatch what;
if( regex_match( text, what, all ) )
{
smatch::nested_results_type::const_iterator begin = what.nested_results().begin();
smatch::nested_results_type::const_iterator end = what.nested_results().end();
for(;it != end; it++)
{
if(it->regex_id() == abcd.regex_id())
{
// you match abcd
// use it->begin() and it->end()
// or it->position() and it->length()
continue;
}
if(it->regex_id() == abce.regex_id())
{
// you match abcd...
continue;
};
}
I think is not best solution, you could check “Semantic Actions and User-Defined Assertions” in documentation.

Regular Expressions are part of the standard extension tr1 and implemented in a number of standard libraries (i.e. dinkumware)
I think that its very straightforward to write the surrounding code yourself.

Doesn't it work with an simple or?
"abcd|abcd"
which is a valid regular expression.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

BOOST Regex global Search behavior - c++

Related

c++ regex_replace not doing intended substitution

C++ RegExp and placeholders

std::string search for numbers in a string & insert space before & after [duplicate]

How to extract a string that is present between two brackets?

Regular expression library that returns all matches for multiple patterns in one run for C++?

Categories

Resources