Extract numbers from string (Regex C++) - c++

let's say i hve a string S = "1 this is a number=200; Val+54 4class find57"
i want to use Regex to extract only this numbers:
num[1] = 1
num[2] = 200
num[3] = 54
and not the 4 in "4class" or 57 in "find57" which means only numbers that are surrounded by Operators or space.
i tried this code but no results:
std::string str = "1 this is a number=200; Val+54 4class find57";
boost::regex re("(\\s|\\-|\\*|\\+|\\/|\\=|\\;|\n|$)([0-9]+)(\\s|\\-|\\*|\\+|\\/|\\;|\n|$)");
boost::sregex_iterator m1(str.begin(), str.end(), re);
boost::sregex_iterator m2;
for (; m1 != m2; ++m1) {
advm1->Lines->Append((*m1)[1].str().c_str());
}
by the way i'am using c++ Builder XE6.

Just use word boundaries. \b matches between a word character and a non-word character.
\b\d+\b
OR
\b[0-9]+\b
DEMO
Escape the backslash one more time if necessary like \\b\\d+\\b or \\b[0-9]+\\b

Related

How to find the exact substring with regex in c++11?

I am trying to find substrings that are not surrounded by other a-zA-Z0-9 symbols.
For example: I want to find substring hello, so it won't match hello1 or hellow but will match Hello and heLLo!##$%.
And I have such sample below.
std::string s = "1mySymbol1, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]*" + sub + "[^a-zA-Z0-9]*", std::regex::icase);
std::smatch match;
while (std::regex_search(s, match, rgx)) {
std::cout << match.size() << "match: " << match[0] << '\n';
s = match.suffix();
}
The result is:
1match: mySymbol
1match: , /_mySymbol_
1match: mysymbol
But I don't understand why first occurance 1mySymbol1 also matches my regex?
How to create a proper regex that will ignore such strings?
UDP
If I do like this
std::string s = "mySymbol, /_mySymbol_ mysymbol";
const std::string sub = "mysymbol";
std::regex rgx("[^a-zA-Z0-9]+" + sub + "[^a-zA-Z0-9]+", std::regex::icase);
then I find only substring in the middle
1match: , /_mySymbol_
And don't find substrings at the beggining and at the end.
The regex [^a-zA-Z0-9]* will match 0 or more characters, so it's perfectly valid for [^a-zA-Z0-9]*mysymbol[^a-zA-Z0-9]* to match mysymbol in 1mySymbol1 (allowing for case insensitivity). As you saw, this is fixed when you use [^a-zA-Z0-9]+ (matching 1 or more characters) instead.
With your update, you see that this doesn't match strings at the beginning or end. That's because [^a-zA-Z0-9]+ has to match 1 or more characters (which don't exist at the beginning or end of the string).
You have a few options:
Use beginning/end anchors: (?:[^a-zA-Z0-9]+|^)mysymbol(?:[^a-zA-Z0-9]+|$) (non-alphanumeric OR beginning of string, followed by mysymbol, followed by non-alphanumeric OR end of string).
Use negative lookahead and negative lookbehind: (?<![a-zA-Z0-9])mysymbol(?![a-zA-Z0-9]) (match mysymbol which doesn't have an alphanumeric character before or after it). Note that using this the match won't include the characters before/after mysymbol.
I recommend using https://regex101.com/ to play around with regular expressions. It lists all the different constructs you can use.

Split string with specific constraint on delimiter

Suppose we have a string: "((0.2,0), (1.5,0)) A1 ABC p". I want to split it into logical units like this:
((0.2,0), (1.5,0))
A1
ABC
p
I.e. split string by whitespaces with requirement that previous character isn't a comma.
Is it possible to use regex as solution?
Update: I've tried in this way:
#include <iostream>
#include <string>
#include <regex>
int main()
{
std::string s = "((0.2,0), (1.5,0)) A1 ABC p";
std::regex re("[^, ]*\\(, *[^, ]*\\)*"); // as suggested in the updated answers
std::sregex_token_iterator
p(s.begin(), s.end(), re, -1);
std::sregex_token_iterator end;
while (p != end)
std::cout << *p++ << std::endl;
}
The result was: ((0.2,0), (1.5,0)) A1 ABC p
Solution:
#include <iostream>
#include <string>
#include <regex>
int main() {
std::string s = "((0.2,0), (1.5,0)) A1 ABC p";
std::regex re("[^, ]*(, *[^, ]*)*");
std::regex_token_iterator<std::string::iterator> p(s.begin(), s.end(), re);
std::regex_token_iterator<std::string::iterator> end;
while (p != end)
std::cout << *p++ << std::endl;
}
Output:
((0.2,0), (1.5,0))
A1
ABC
p
you can do it like this:
[^, ]*(, *[^, ]*)*
what does this do?
first lets go over basics of regular expressions:
the [] defines a group of characters that you want to match for example [ab] will match an 'a' or 'b'.
If you use [^] syntax that describes all the characters you do NOT want to match so [^ab] will match anything that is NOT and 'a' or a 'b'.
the * symbol tell the regular expression that the previous match can appear zero or more times. so a* will match the empty string '' or 'a' or 'aaa' or 'aaaaaaaaaaaaa'
When you put () around a part of an expression that creates a group that you can then so interesting things with in our case we used it so that we could define a part of the pattern that we wanted to be optional by putting * next to it so that it could appear zero or more times.
Ok putting all together:
The fist part [^ ,]* says: Match zero or more character that are NOT ' ' or ',' this wil match string like 'A1' or '((0.2"
The second part in ()* is used to continue matching string that have ',' and space in them but that you do not want to split, this part is optional so that it correctly matches 'A1' or 'ABC' or 'p'.
So (, *[^, ]*)* will match zero or more strings that start with ',' and any number of ' ' followed by a string that does not have ',' or ' ' in it. So in your example it would match ",0)" which is the continuation of "((0.2" and also match ", (1.5" and again ",0))" which will all get added together to make "((0.2,0), (1.5,0))"
NOTE: You may need to escape some characters in your expression based on the regular expression library you are using. The solution will work in this online tester http://www.regexpal.com/
but some libraries and tools need you to escape things like the (
so the expression would look like:
[^, ]*\(, *[^, ]*\)*
Also I removed the ( |$) part is it is only required if you want the ending space to be part of the match.

Qt C++ QRegExp parse string

I have the string str. I want to get two strings ('+' and '-'):
QString str = "+asdf+zxcv-tyupo+qwerty-yyuu oo+llad dd ff";
// I need this two strings:
// 1. For '+': asdf,zxcv,qwerty,llad dd ff
// 2. For '-': tyupo,yyuu oo
QRegExp rx("[\\+\\-](\\w+)");
int pos = 0;
while ((pos = rx.indexIn(str, pos)) != -1) {
qDebug() << rx.cap(0);
pos += rx.matchedLength();
}
Output I need:
"+asdf"
"+zxcv"
"-tyupo"
"+qwerty"
"-yyuu oo"
"+llad dd ff"
Output I get:
"+asdf"
"+zxcv"
"-tyupo"
"+qwerty"
"-yyuu"
"+llad"
If I replace \\w by .* the output is:
"+asdf+zxcv-tyupo+qwerty-yyuu oo+llad dd ff"
You can use the following regex:
[+-]([^-+]+)
See regex demo
The regex breakdown:
[+-] - either a + or -
([^-+]+) - a capturing group matching 1 or more symbols other than - and +.
Your regexp is excessive:
[\\+\\-](\\w+)
\______/\____/
^ ^--- any amount of alphabetical characters
^--- '+' or '-' sign
So what you are capturing is the +/- sign, and any word that follows it directly. If you want to capture only the +/- signs, use [+-] as a regular expression.
EDIT:
To get the strings including the spaces, you need
QRegExp rx("[+-](\\w|\\s)+");

regex with all components optionals, how to avoid empty matches

I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.
Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])
I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");

Regular expression that matches string equals to one in a group

E.g. I want to match string with the same word at the end as at the begin, so that following strings match:
aaa dsfj gjroo gnfsdj riier aaa
sdf foiqjf skdfjqei adf sdf sdjfei sdf
rew123 jefqeoi03945 jq984rjfa;p94 ajefoj384 rew123
This one could do te job:
/^(\w+\b).*\b\1$/
explanation:
/ : regex delimiter
^ : start of string
( : start capture group 1
\w+ : one or more word character
\b : word boundary
) : end of group 1
.* : any number of any char
\b : word boundary
\1 : group 1
$ : end of string
/ : regex delimiter
M42's answer is ok except degenerate cases -- it will not match string with only one word. In order to accept those within one regexp use:
/^(?:(\w+\b).*\b\1|\w+)$/
Also matching only necessary part may be significantly faster on very large strings. Here're my solutions on javascript:
RegExp:
function areEdgeWordsTheSame(str) {
var m = str.match(/^(\w+)\b/);
return (new RegExp(m[1]+'$')).test(str);
}
String:
function areEdgeWordsTheSame(str) {
var idx = str.indexOf(' ');
if (idx < 0) return true;
return str.substr(0, idx) == str.substr(-idx);
}
I don't think a regular expression is the right choice here. Why not split the the lines into an array and compare the first and the last item:
In c#:
string[] words = line.Split(' ');
return words.Length >= 2 && words[0] == words[words.Length - 1];