C++11 Regex search - Exclude empty submatches - regex

From the following text I want to extract the number and the unit of measurement.
I have 2 possible cases:
This is some text 14.56 kg and some other text
or
This is some text kg 14.56 and some other text
I used | to match the both cases.
My problem is that it produces empty submatches, and thus giving me an incorrect number of matches.
This is my code:
std::smatch m;
std::string myString = "This is some text kg 14.56 and some other text";
const std::regex myRegex(
R"(([\d]{0,4}[\.,]*[\d]{1,6})\s+(kilograms?|kg|kilos?)|s+(kilograms?|kg|kilos?)(\s+[\d]{0,4}[\.,]*[\d]{1,6}))",
std::regex_constants::icase
);
if( std::regex_search(myString, m, myRegex) ){
std::cout << "Size: " << m.size() << endl;
for(int i=0; i<m.size(); i++)
std::cout << m[i].str() << std::endl;
}
else
std::cout << "Not found!\n";
OUTPUT:
Size: 5
kg 14.56
kg
14.56
I want an easy way to extract those 2 values, so my guess is that I want the following output:
WANTED OUTPUT:
Size: 3
kg 14.56
kg
14.56
This way I can always directly extract 2nd and 3th, but in this case I would also need to check which one is the number. I know how to do it with 2 separate searches, but I want to do it the right way, with a single search without using c++ to check if a submatch is an empty string.

Using this regex, you just need the contents of Group 1 and Group 2
((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))\s*((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))
Click for Demo
Explanation:
((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))
(?:kilograms?|kilos?|kg) - matches kilograms or kilogram or kilos or kilo or kg
| - OR
(?:\d{0,4}(?:\.\d{1,6})) - matches 0 to 4 digits followed by 1 to 6 digits of decimal part
\s* - matches 0+ whitespaces

You can try this out:
((?:(?<!\d)(\d{1,4}(?:[\.,]\d{1,6})?)\s+((?:kilogram|kilos|kg)))|(?:((?:kilogram|kilos|kg))\s+(\d{1,4}(?:[\.,]\d{1,6})?)))
As shown here: https://regex101.com/r/9O99Fz/3
USAGE -
As I've shown in the 'substitution' section, to reference the numeral part of the quantity, you have to write $2$5, and for the unit, write: $3$4
Explanation -
There are two capturing groups we could possibly need: the first one here (?:(?<!\d)(\d{1,4}(?:[\.,]\d{1,6})?)\s+((?:kilogram|kilos|kg))) is to match the number followed by the unit,
and the other (?:((?:kilogram|kilos|kg))\s+(\d{1,4}(?:[\.,]\d{1,6})?)) to match the unit followed by the number

Related

RegEx select all between two character

Example:
I want to extract everything between "Item:" until " * "
Item: *Sofa (1 SET), 2 × Mattress, 3 × Baby Mattress, 5
Seaters Car (Fabric)*
Total price: 100.00
Subtotal: 989.00
But I only managed to extract "Item: *" and " Seaters Car (Fabric)* " by using (.*?)\*
After matching Item:, match anything but a colon with [^:]+, and then lookahead for a newline, ensuring that the match ends at the end of a line just before another label (like Total price:) starts:
Item: ([^:]+)(?=\n)

c++11 (MSVS2012) regex looking for file names in multiple line std::string

I have been trying to search for a clear answer on this one, but not been able to find it.
So lets say I have the string (where \n could be \r\n - I want to handle both - not sure if that is relevant or not)
"4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54"
Then I want to get matches:
a_file_123.xml
a_file_j34.xml
Here is my test code:
const str::string s = "4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54";
std::smatch matches;
if (std::regex_search(s, matches, std::regex("a_file_(.*)\\.xml")))
{
std::cout << "total: " << matches.size() << std::endl;
for (unsigned int i = 0; i < matches.size(); i++)
{
std::cout << "match: " << matches[i] << std::endl;
}
}
Output is:
total: 2
match: a_file_123.xml
match: 123
I don't quite understand why match 2 is just "123"...
You only have one match, not two, as the regex_search method returns a single match. What you printed is two group values, Group 0 (the whole match, a_file_123.xml here) and Group 1 (the capturing group value, here, 123 that is a substring captured with a capturing group you defined as (.*) in the pattern).
If you want to match multiple strings, you need to use the regex iterator, not just a regex_search that only returns the first match.
Besides, .* is too greedy and will return weird results if you have more than 1 match on the same line. It seems you want to match letter or digits, so .* can be replaced with \w+. Well, if there can really be anything, just use .*?.
Use
const std::string s = "4345t435\ng54t a_file_123.xml rk\ngreg a_file_j34.xml fger 43t54";
const std::regex rx("a_file_\\w+\\.xml");
std::vector<std::string> results(std::sregex_token_iterator(s.begin(), s.end(), rx),
std::sregex_token_iterator());
std::cout << "Number of matches: " << results.size() << std::endl;
for (auto result : results)
{
std::cout << result << std::endl;
}
See the C++ demo yielding
Number of matches: 2
a_file_123.xml
a_file_j34.xml
Notes on regex
a_file_ - a literal substring
\\w+ - 1+ word chars (letters, digits, _) (note you may use [^.]*? here instead of \\w+ if you want to match any char, 0 or more repetitions, as few as possible, up to the first .xml)
\\. - a dot (if you do not escape it, it will match any char except line break chars)
xml - a literal substring.
See the regex demo

C++ regex: Get index of the Capture Group the SubMatch matched to

Context. I'm developing a Lexer/Tokenizing engine, which would use regex as a backend. The lexer accepts rules, which define the token types/IDs, e.g.
<identifier> = "\\b\\w+\\b".
As I envision, to do the regex match-based tokenizing, all of the rules defined by regexes are enclosed in capturing groups, and all groups are separated by ORs.
When the matching is being executed, every match we produce must have an index of the capturing group it was matched to. We use these IDs to map the matches to token types.
So the problem of this question arises - how to get the ID of the group?
Similar question here, but it does not provide the solution to my specific problem.
Exactly my problem here, but it's in JS, and I need a C/C++ solution.
So let's say I've got a regex, made up of capturing groups separated by an OR:
(\\b[a-zA-Z]+\\b)|(\\b\\d+\\b)
which matches the the whole numbers or alpha-words.
My problem requires that the index of the capture group the regex submatch matched to could be known, e.g. when matching the string
foo bar 123
3 iterations will be done. The group indexes of the matches of every iteration would be 0 0 1, because the first two matches matched the first capturing group, and the last match matched the second capturing group.
I know that in standard std::regex library it's not entirely possible (regex_token_iterator is not a solution, because I don't need to skip any matches).
I don't have much knowledge about boost::regex or PCRE regex library.
What is the best way to accomplish this task? Which is the library and method to use?
You may use the sregex_iterator to get all matches, and once there is a match you may analyze the std::match_results structure and only grab the ID-1 value of the group that participated in the match (note only one group here will match, either the first one, or the second), which can be conveniently checked with the m[index].matched:
std::regex r(R"((\b[[:alpha:]]+\b)|(\b\d+\b))");
std::string s = "foo bar 123";
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
i != std::sregex_iterator();
++i)
{
std::smatch m = *i;
std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';
for(auto index = 1; index < m.size(); ++index ){
if (m[index].matched) {
std::cout << "Capture group ID: " << index-1 << std::endl;
break;
}
}
}
See the C++ demo. Output:
Match value: foo at Position 0
Capture group ID: 0
Match value: bar at Position 4
Capture group ID: 0
Match value: 123 at Position 8
Capture group ID: 1
Note that R"(...)" is a raw string literal, no need to double backslashes inside it.
Also, index is set to 1 at the start of the for loop because the 0th group is the whole match, but you want group IDs to be zero-based, that is why 1 is subtracted later.

Regex on numbers and spaces

I'm trying to match numbers surrounded by spaces, like this string:
" 1 2 3 "
I'm puzzled why the regex \s[0-9]\s matches 1 and 3 but not 2. Why does this happen?
Because the space has already been consumed:
\s[0-9]\s
This matches "spacedigitspace" so lets go through the process
" 1 2 3 "
^
|
No match
" 1 2 3 "
^
|
No match
" 1 2 3 "
^
|
Matches, consume " 1 "
"2 3 "
^
|
No match
"2 3 "
^
|
No match
"2 3 "
^
|
No match
"2 3 "
^
|
Matches, consume " 3 "
You want a lookaround:
(?<=\s)\d(?=\s)
This is very different, as it look for \d and then asserts that it is preceded by, and followed by, a space. This assertion is "zero width" which means that the spaces aren't consumed by the engine.
More precisely, the regex \s[0-9]\s does not match 2 only when you go through all matches in the string " 1 2 3 " one by one. If you were to try to start matching at positions 1 or 2, " 2 " would be matched.
The reason for this is that \s is capturing part of the input - namely, the spaces around the digit. When you match " 1 ", the space between 1 and 2 is already taken; the regex engine is looking at the tail of the string, which is "2 3 ". At this point, there is no space in front of 2 that the engine could capture, so it goes straight to finding " 3 "
To fix this, put spaces into zero-length look-arounds, like this:
(?<=\s)[0-9](?=\s)
Now the engine ensures that there are spaces in front and behind the digit without consuming these spaces as part of the match. This lets the engine treat the space between 1 and 2 as a space behind 1 and also as a space in front of 2, thus returning both matches.
The input is captured, and the subsequent matches won't match, you can use a lookahead to fix this
\s+\d+(?=\s+)
The expression \s[0-9]\s mathces " 1 " and " 3 ". As the space after the 1 is matched, it can't also be used to match " 2 ".
You can use a positive lookbehind and a positive lookahead to match digits that are surrounded by spaces:
(?<= )(\d+)(?= )
Demo: https://regex101.com/r/hT1dT6/1

How to do a pattern match on a certain decimal pattern

I am trying to perform a pattern match in C++ where the format is...
###.######## (example input would be 135.123551235)
I have tried the following pattern but it won't match with the data I have inputted...
// get the points entered
getline(cin, x1ANDy1);
regex r("([0-9]+)\.([0-9]+)", regex_constants::basic);
if (regex_match(x1ANDy1, r))
{
cout << "Data has been entered properly.";
}
else
{
cout << "Data has been entered in the improper format, please re-enter your data.";
}
This pattern would be "\d{3}.\d{9}" for exact 3 digits dot(.) 9 digits or "\d{lowerbound, upperbound}.\d{lowerbound,upperbound}" if you want to allow certain digit amounts. Or replace the curly braces with * if you dont want to limit it.