Boost Regex unknown number of var - regex

I have an issue with a regex expression and need some help. I have some expressions like these in mein .txt File:
19 = NAND (1, 19)
regex expression : http://rubular.com/r/U8rO09bvTO
With this regex expression I got seperated matches for the numbers.
But now I need a regex expression with an unknown amount of numbers in the bracket.
For example:
19 = NAND (1, 23, 13, 24)
match1: 19, match2: 1, match3: 23, match4: 13, match5: 24
I don't know the number of the numbers. So I need a main expression for min 2 numbers in the bracket till a unknow number.
By the way i'm using c++.
# Martjin Your first regex expression worked very well thanks.
Here my code:
boost::cmatch result;
boost::regex matchNand ("([0-9]*) = NAND\\((.*?)\\)");
boost::regex matchNumb ("(\\d+)");
string cstring = "19 = NAND (1, 23, 13, 24)";
boost::regex_search(cstring.c_str(), result, matchNand);
cout << "NAND: " << result[1] << "=" << result[2] << endl;
string str = result[2];
boost::regex_search(str.c_str(), result, matchNumb);
cout << "NUM: " << result[1] << "," << result[2]<< "," << result[3] << "," << result[4] << endl;
My output:
NAND: 19=1, 23, 13, 24
NUM: 1,,,
So my new problem is i only find the first number.
The result is also in complete opposite with this solution: http://rubular.com/r/nqXDSuBXjc

A simple (and maybe more clear than one regex) is to split this into two regexes.
First run a regex that splits your result from your arguments:
http://rubular.com/r/YkGdkkg4y3
([0-9]*) = NAND \((.*?)\)
Then perform a regex that will match all the numbers in your argument: http://rubular.com/r/2vpSbZvz12
\d+
Assuming you're using Ruby, you can perform a regex that matches multiple times with the function scan as explained here: http://ruby-doc.org/core-1.9.3/String.html#method-i-scan
Of course you could just use the second regex with the scan function to get all the numbers from that line, but I'm guessing you're going to expand it even more, which is when this approach will be a little more structured.

Related

C++11 Regex search - Exclude empty submatches

From the following text I want to extract the number and the unit of measurement.
I have 2 possible cases:
This is some text 14.56 kg and some other text
or
This is some text kg 14.56 and some other text
I used | to match the both cases.
My problem is that it produces empty submatches, and thus giving me an incorrect number of matches.
This is my code:
std::smatch m;
std::string myString = "This is some text kg 14.56 and some other text";
const std::regex myRegex(
R"(([\d]{0,4}[\.,]*[\d]{1,6})\s+(kilograms?|kg|kilos?)|s+(kilograms?|kg|kilos?)(\s+[\d]{0,4}[\.,]*[\d]{1,6}))",
std::regex_constants::icase
);
if( std::regex_search(myString, m, myRegex) ){
std::cout << "Size: " << m.size() << endl;
for(int i=0; i<m.size(); i++)
std::cout << m[i].str() << std::endl;
}
else
std::cout << "Not found!\n";
OUTPUT:
Size: 5
kg 14.56
kg
14.56
I want an easy way to extract those 2 values, so my guess is that I want the following output:
WANTED OUTPUT:
Size: 3
kg 14.56
kg
14.56
This way I can always directly extract 2nd and 3th, but in this case I would also need to check which one is the number. I know how to do it with 2 separate searches, but I want to do it the right way, with a single search without using c++ to check if a submatch is an empty string.
Using this regex, you just need the contents of Group 1 and Group 2
((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))\s*((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))
Click for Demo
Explanation:
((?:kilograms?|kilos?|kg)|(?:\d{0,4}(?:\.\d{1,6})))
(?:kilograms?|kilos?|kg) - matches kilograms or kilogram or kilos or kilo or kg
| - OR
(?:\d{0,4}(?:\.\d{1,6})) - matches 0 to 4 digits followed by 1 to 6 digits of decimal part
\s* - matches 0+ whitespaces
You can try this out:
((?:(?<!\d)(\d{1,4}(?:[\.,]\d{1,6})?)\s+((?:kilogram|kilos|kg)))|(?:((?:kilogram|kilos|kg))\s+(\d{1,4}(?:[\.,]\d{1,6})?)))
As shown here: https://regex101.com/r/9O99Fz/3
USAGE -
As I've shown in the 'substitution' section, to reference the numeral part of the quantity, you have to write $2$5, and for the unit, write: $3$4
Explanation -
There are two capturing groups we could possibly need: the first one here (?:(?<!\d)(\d{1,4}(?:[\.,]\d{1,6})?)\s+((?:kilogram|kilos|kg))) is to match the number followed by the unit,
and the other (?:((?:kilogram|kilos|kg))\s+(\d{1,4}(?:[\.,]\d{1,6})?)) to match the unit followed by the number

Regular expression to match input of n words separated by m spaces [duplicate]

This question already has answers here:
Regular expression capturing a repeated group
(1 answer)
c++ std::regex, smatch retains subexpressions only once for their apperance in a pattern string
(1 answer)
Closed 6 years ago.
So I'm learning regular expressions in c++11 and i'm trying to create a regular expression to match an input of N words separeted by M spaces.
So, for example, you input " word word word word ..." and you can continue like this for how long you like.
Now my problems come when I try to access the fields in the smatch variable after comparing an input to the regular expression. At the moment what I have is:
#include <regex>
regex input_reg(
"(?:[[:space:]]*"
"([[:alpha:]_]+)"
"[[:space:]]*)+");
smatch comparison;
if (regex_match(input, comparison, input_reg)){
for (smatch::size_type i = 0; i < comparison.size(); ++i){
cout << i << ": '" << comparison.str(i) << "'" << endl;
}
}
The problem with this is that for some reason, I get a match as I should but when I try to cout all the fields to see if it works I only get the initial match and the first field, nothing else:
0: ' word word word word '
1: 'word'
What am I doing wrong?
EDIT: The input is as seen in cout example of my code, it doesn't show all the spaces in the text for some reason.

Getting the index of a slice

I want to do some processing on a string in Scala. The first stage of that is finding the index of articles such as: "A ", " A ", "a ", " a ". I am trying to do that like this:
"A house is in front of us".indexOfSlice("\\s+[Aa] ")
I think this should return 0, as the substring is first matched in the first position of the string.
However, this returns -1.
Why does it return -1? Is the regex I am using incorrect?
The other answers as I type this are just missing the point. Your problem is that indexOfSlice doesn't take a regexp, but a sub-sequence to seach for in the sequence. So fixing the regexp won't help at all.
Try this:
val pattern = "\\b[Aa]\\b".r.unanchored
for (mo <- pattern.findAllMatchIn("A house is in front of us, a house is in front of us all")) {
println("pattern starts at " + mo.start)
}
//> pattern starts at 0
//| pattern starts at 27
(with fixed regex, too)
Edit: counter-example for the popular but wrong suggestion of "\\s*[Aa] "
val pattern2 = "\\s*[Aa] ".r.unanchored
for (mo <- pattern2.findAllMatchIn("The agenda is hidden")) {
println("pattern starts at " + mo.start)
}
//> pattern starts at 9
I see a mistake in your regex. your regex is searching for
at least once space (\s+)
a letter (either A or a)
but string you are matching doesn't contain space in beginning. that's why It's not returning you index 0 but -1.
you could write your regex as "^\\s*[Aa] "
Here is example:
val text = "A house is in front of us";
val matcher = Pattern.compile("^\\s*[Aa] ").matcher(text)
var idx = 0;
if(matcher.find()){
idx = matcher.start()
}
println(idx)
it should return 0 as expected.

Regex help, match produces extra spaces

I am using the boost/regex.hpp library. The regex is intended to match a floating point number or one of an arbitrary list of math operators. The trailing a is a place holder because the current code to construct the regex leaves a | at the end, and I haven't fixed it yet. My regex is:
(?:([-+]?[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?)|(\s*sqrt\((.+?)\)\s*)|(\s*exp\((.+?)\)\s*)|(\^)|(\s*log2\((.+?)\)\s*)|(\s*log10\((.+?)\)\s*)|(\s*neg\((.+?)\)\s*)|(\s*floor\((.+?)\)\s*)|(\s*log\((.+?)\)\s*)|(\s*fact\((.+?)\)\s*)|(/)|([*])|([+])|([-])|a)
and my test string is:
4.5 + 9.6e8 + sqrt(5)
The resulting match is:
4.5 + 9.6e8 + sqrt(5) 5
I'm not sure why there are so many spaces between the captures.
The printing code is
boost::regex reg(token);
boost::smatch m;
string s = input;
while (boost::regex_search(s, m, reg)) {
for (int i = 1; i < m.size(); ++i) cout << m[i] << " ";
s = m.suffix().str();
}
You have a lot of capturing parentheses and you are printing a space between each capture group. Many of your capture groups are empty. Maybe you want to refactor your regex to only capture what you really want.

QRegExp not extracting text as expected

I am trying to extract text from between square brackets on a line of text. I've been messing with the regex for some time now, and cannot get what I need. (I can't even explain why the output is what it is). Here's the code:
QRegExp rx_timestamp("\[(.*?)\]");
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
The input line is:
messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found
And the output is:
Captured texts: (".")
timestamp cap: "."
timestamp cap: ""
timestamp cap: ""
Can someone explain what is going on? Why is cap returning "." when no such character exists between square brackets
Can someone correct the regex to extract the timestamp from between the square brackets?
You are missing two things. Escaping the backslash, and using setMinimal. See below.
QString line = "messages:[2013-10-08 09:13:41] NOTICE[2366] chan_sip.c: Registration from '\"xx000 <sip:xx000#183.229.164.42:5060>' failed for '192.187.100.170' - No matching peer found";
QRegExp rx_timestamp("\\[(.*)\\]");
rx_timestamp.setMinimal(true);
int pos = rx_timestamp.indexIn(line);
if (pos > -1) {
qDebug() << "Captured texts: " << rx_timestamp.capturedTexts();
qDebug() << "timestamp cap: " <<rx_timestamp.cap(0);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(1);
qDebug() << "timestamp cap: " <<rx_timestamp.cap(2);
} else qDebug() << "No indexin";
Output:
Captured texts: ("[2013-10-08 09:13:41]", "2013-10-08 09:13:41")
timestamp cap: "[2013-10-08 09:13:41]"
timestamp cap: "2013-10-08 09:13:41"
timestamp cap: ""
UPDATE: What is going on:
A backslash in c++ source code indicates that the next character is an escape character, such as \n. To have a backslash show up in a regular expression you have to escape a backslash like so: \\ That will make it so that the Regular Expression engine sees \, like what Ruby, Perl or Python would use.
The square brackets should be escaped, too, because they are used to indicate a range of elements normally in regex.
So for the Regular expression engine to see a square bracket character you need to send it
\[
but a c++ source file can't get a \ character into a string without two of them in a row so it turns into
\\[
While learning regex, I liked using this regex tool by GSkinner. It has a listing on the right hand side of the page of unique codes and characters.
QRegEx doesn't match regex exactly. If you study the documentation you find a lot of little things. Such as how it does Greedy v. Lazy matching.
QRegExp and double-quoted text for QSyntaxHighlighter
How the captures are listed is pretty typical as far as I have seen from regex parsers. The capture listing first lists all of them, then it lists the first capture group (or what was enclosed by the first set of parentheses.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#cap
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#capturedTexts
To find more matches, you have to iteratively call indexIn.
http://qt-project.org/doc/qt-5.0/qtcore/qregexp.html#indexIn
QString str = "offsets: 1.23 .50 71.00 6.00";
QRegExp rx("\\d*\\.\\d+"); // primitive floating point matching
int count = 0;
int pos = 0;
while ((pos = rx.indexIn(str, pos)) != -1) {
++count;
pos += rx.matchedLength();
}
// pos will be 9, 14, 18 and finally 24; count will end up as 4
Hope that helps.