boost regex to extract a number from string - regex

I have a string
resource = "/Music/1"
the string can take multiple numeric values after "/Music/" . I new to regular expression stuff . I tried following code
#include <iostream>
#include<boost/regex.hpp>
int main()
{
std::string resource = "/Music/123";
const char * pattern = "\\d+";
boost::regex re(pattern);
boost::sregex_iterator it(resource.begin(), resource.end(), re);
boost::sregex_iterator end;
for( ; it != end; ++it)
{
std::cout<< it->str() <<"\n";
}
return 0;
}
vickey#tb:~/trash/boost$ g++ idExtraction.cpp -lboost_regex
vickey#tb:~/trash/boost$ ./a.out
123
works fine . But even when the string happens to be something like "/Music23/123" it give me a value 23 before 123. When I use the pattern "/\d+" it would give results event when the string is /23/Music/123. What I want to do is extract the only number after "/Music/" .

I think part of the problem is that you haven't defined very well (at least to us) what it is you are trying to match. I'm going to take some guesses. Perhaps one will meet your needs.
The number at the end of your input string. For example "/a/b/34". Use regex "\\d+$".
A path element that is entirely numeric. For example "/a/b/12/c" or "/a/b/34" but not "/a/b56/d". Use regex "(?:^|/)(\\d+)(?:/|$)" and get captured group [1]. You might do the same thing with lookahead and lookbehind, perhaps with "(?<=^|/)\\d+(?=/|$)".

If there will never be anything after the last slash could you just use a regex or string.split() to get everything after the last slash. I'd get you code but I'm on my phone now.

Related

Quick regex_search/replace, or clear indication of replacement?

I must browse a collection of strings to replace a pattern and save the changes.
The saving operation is (very) expensive and out of my hands, so I would like to know beforehand if the replacement did anything.
I can use std::regex_search to gain knowledge on the pattern's presence in my input, and use capture groups to store details in a std::smatch. std::regex_replace does not seem to explicitely tell me wether it did anything.
The patterns and strings are arbitrarily long and complicated; running regex_replace after a regex_search seems wasteful.
I can directly compare the input and output to search for a discrepancy but that too is uncomfortable.
Is there either a simple way to observe regex_replace to determine its impact, or to use a smatch filled by the regex_search to do a faster replacement operation ?
Thanks in advance.
No regex_replace doesn't provide this info and yes you can do it with a regex_search loop.
For example like this:
std::regex pattern("...");
std::string replacement_format = "...";
std::string input = "......"; // a very, very long string
std::string output, replacement;
std::smatch match;
auto begin = input.cbegin();
int replacements = 0;
while (std::regex_search(begin, input.cend(), match, pattern)) {
output += match.prefix();
replacement = match.format(replacement_format);
if (match[0] != replacement) {
replacements++;
}
output += replacement;
begin = match.suffix().first;
}
output.append(begin, input.cend());
if (replacements > 0) {
// process output ...
}
Live demo
As regex_replace creates a copy of your string you could simply compare the replaced string with the original one and only "store" the new one if they differ.
For C++14 it seems that regex_replace returns a pointer to the last place it has written to:
https://www.cplusplus.com/reference/regex/regex_replace/ Versions 5
and 6 return an iterator that points to the element past the last
character written to the sequence pointed by out.

Using regex to parse out numbers

My problem is more or less self-explanatory, I want to write a regex to parse out numbers out of a string that user enters via console. I take the user input using:
getline(std::cin,stringName); //1 2 3 4 5
I asume that user enters N numbers followed by white spaces except the last number.
I have solved this problem by analyzing string char by char like this:
std::string helper = "";
std::for_each(stringName.cbegin(), strinName.cend(), [&](char c)
{
if (c == ' ')
{
intVector.push_back(std::stoi(helper.c_str()));
helper = "";
}
else
helper += c;
});
intVector.push_back(std::stoi(helper.c_str()));
I want to achieve the same behavior by using regex. I've wrote the following code:
std::regex rx1("([0-9]+ )");
std::sregex_iterator begin(stringName.begin(), stringName.end(), rx1);
std::sregex_iterator end;
while (begin != end)
{
std::smatch sm = *begin;
int number = std::stoi(sm.str(1));
std::cout << number << " ";
}
Problem with this regex occurs when it gets to the last number since it doesn't have space behind it, therefore it enters an infinite loop. Can someone give me an idea on how to fix this?
You're going to get an endless loop there because you never increment begin. If you do that, you'll get all the numbers except the last one (which, as you say, is not followed by a space).
But I don't understand why you feel it necessary to include the whitespace in the regular expression. If you just match a string of digits, the regex will automatically select the longest possible match, so the following character (if any) cannot be a digit.
I also see no value in the capture in the regex. If you wanted to restrict the capture to the number itself, you would have used ([0-9]+). (But since stoi only converts until it finds a non-digit, it doesn't matter.)
So you just use this:
std::regex rx1("[0-9]+");
for (auto it = std::sregex_iterator{str.begin(), str.end(), rx1},
end = std::sregex_iterator{};
it != end;
++it) {
std::cout << std::stoi(it->str(0)) << '\n';
}
(Live on coliru)

need support defining the right regex

I would like to parse a file using boost::sregex_token_iterator.
Unfortunately I'm not able to find the right regex to extract strings in the form FOO:BAR out of it.
The below code example is usable only if one such occurence per line is found, but I would like to support multiple of this entries per line, and ideally also a comment after an '#'
So entries like this
AA:BB CC:DD EE:FF #this is a comment
should result in 3 identified token (AA:BB, CC:DD, EE:FF)
boost::regex re("((\\W+:\\W+)\\S*)+");
boost::sregex_token_iterator i(line.begin(), line.end(), re, -1), end;
for(; i != end; i++){
std::stringstream ss(*i);
...
}
Any support is very welcome.
I suggest you use splitting to get the values you need.
I would begin by first splitting using #. This separates the comment from the rest of the line. Then split using white space, which separates the pairs out. After this, individual pairs can be split using :.
If, for whatever reason, you must use regex, you can iterate over the matches. In this case I would use the following regex:
(?:#(?:.*))*(\w+:\w+)\s*
This regex will match every pair until it finds a comment. If there is a comment, it will skip to the next new line.
You want to match sequences of 1 or more word chars followed with : and then having again 1 or more word chars.
Thus, you need to replace -1 with 1 in the call to boost::sregex_token_iterator to get Group 1 text chunks and replace the regex you use with \w+:\w+ pattern:
boost::regex re(R"(#.*|(\w+:\w+))");
boost::sregex_token_iterator i(line.begin(), line.end(), re, 1), end;
Note that R"(#.*|(\w+:\w+))" is a raw string literal that actually represents #.*|(\w+:\w+) pattern that matches # and then the rest of the line or matches and captures the pattern you need into Group 1.
See an std::regex C++ example (you may easily adjust the code for Boost):
#include <string>
#include <iostream>
#include <regex>
using namespace std;
int main() {
std::regex r(R"(#.*|(\w+:\w+))");
std::string s = "AA:BB CC:DD EE:FF #this is a comment XX:YY";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << m[1].str() << '\n';
}
return 0;
}

C++ boost::regex multiples captures

I'm trying to recover multiples substrings thanks to boost::regex and put each one in a var. Here my code :
unsigned int i = 0;
std::string string = "--perspective=45.0,1.33,0.1,1000";
std::string::const_iterator start = string.begin();
std::string::const_iterator end = string.end();
std::vector<std::string> matches;
boost::smatch what;
boost::regex const ex(R"(^-?\d*\.?\d+),(^-?\d*\.?\d+),(^-?\d*\.?\d+),(^-?\d*\.?\d+))");
string.resize(4);
while (boost::regex_search(start, end, what, ex)
{
std::string stest(what[1].first, what[1].second);
matches[i] = stest;
start = what[0].second;
++i;
}
I'm trying to extract each float of my string and put it in my vector variable matches. My result, at the moment, is that I can extract the first one (in my vector var, I can see "45" without double quotes) but the second one in my vector var is empty (matches[1] is "").
I can't figure out why and how to correct this. So my question is how to correct this ? Is my regex not correct ? My smatch incorrect ?
Firstly, ^ is symbol for the beginning of a line. Secondly, \ must be escaped. So you should fix each (^-?\d*\.?\d+) group to (-?\\d*\\.\\d+). (Probably, (-?\\d+(?:\\.\\d+)?) is better.)
Your regular expression searches for the number,number,number,number pattern, not for the each number. You add only the first substring to matches and ignore others. To fix this, you can replace your expression with (-?\\d*\\.\\d+) or just add all the matches stored in what to your matches vector:
while (boost::regex_search(start, end, what, ex))
{
for(int j = 1; j < what.size(); ++j)
{
std::string stest(what[j].first, what[j].second);
matches.push_back(stest);
}
start = what[0].second;
}
You are using ^ at several times in your regex. That's why it didn't match. ^ means the beginning of the string. Also you have an extra ) at the end of the regex. I don't know that closing bracket doing there.
Here is your regex after correction:
(-?\d*\.?\d+),(-?\d*\.?\d+),(-?\d*\.?\d+),(-?\d*\.?\d+)
A better version of your regex can be(only if you want to avoid matching numbers like .01, .1):
(-?\d+(?:\.\d+)?),(-?\d+(?:\.\d+)?),(-?\d+(?:\.\d+)?),(-?\d+(?:\.\d+)?)
A repeated search in combination with a regular expression that apparently is built to match all of the target string is pointless.
If you are searching repeatedly in a moving window delimited by a moving iterator and string.end() then you should reduce the pattern to something that matches a single fraction.
If you know that the number of fractions in your string is/must be constant, match once, not in a loop and extract the matched substrings from what.

Skipping first character regex

I'm currently using the following code to remove certain characters from a string in an array
myArray[x] = myArray[x].replaceAll("[aeiou]","");
which works fine, but I need to ignore the first character of the string so for example if an array element was Alan it would be stripped to Aln.
I'm not sure if using a replaceAll is the best way about doing it but the only other way I can think of is removing the first character, applying the above regex to the string, appending the character back on and then inserting back into the array, which seems a long winded way of doing it.
You can use a negative lookbehind to assert that the pattern is not preceded by the line start marker (^):
public static void main(String[] args) throws Exception {
final String[] input = {"abe", "bae"};
for(final String s: input) {
System.out.println(s.replaceAll("(?<!^)[aeiou]", ""));
}
}
Output:
ab
b
What about something like ..
myArray[x] = myArray[x].replaceAll("(^.)[aeiou]", "\\1");
// upd
Negative lookbehind is your solution, like Boris answered.