Regex Matching optional numbers - c++

I have a text file that is currently parsed with a regex expression, and it's working well. The file format is well defined, 2 numbers, separated by any whitespace, followed by an optional comment.
Now, we have a need to add an additional (but optional) 3rd number to this file, making the format, 2 or 3 numbers separated by whitespace with an optional comment.
I've got a regex object that at least matches all the necessary line formats, but I am not having any luck with actually capturing the 3rd (optional) number even if it is present.
Code:
#include <iostream>
#include <regex>
#include <vector>
#include <string>
#include <cassert>
using namespace std;
bool regex_check(const std::string& in)
{
std::regex check{
"[[:space:]]*?" // eat leading spaces
"([[:digit:]]+)" // capture 1st number
"[[:space:]]*?" // each second set of spaces
"([[:digit:]]+)" // capture 2nd number
"[[:space:]]*?" // eat more spaces
"([[:digit:]]+|[[:space:]]*?)" // optionally, capture 3rd number
"!*?" // Anything after '!' is a comment
".*?" // eat rest of line
};
std::smatch match;
bool result = std::regex_match(in, match, check);
for(auto m : match)
{
std::cout << " [" << m << "]\n";
}
return result;
}
int main()
{
std::vector<std::string> to_check{
" 12 3",
" 1 2 ",
" 12 3 !comment",
" 1 2 !comment ",
"\t1\t1",
"\t 1\t 1\t !comment \t",
" 16653 2 1",
" 16654 2 1 ",
" 16654 2 1 ! comment",
"\t16654\t\t2\t 1\t ! comment\t\t",
};
for(auto s : to_check)
{
assert(regex_check(s));
}
return 0;
}
This gives the following output:
[ 12 3]
[12]
[3]
[]
[ 1 2 ]
[1]
[2]
[]
[ 12 3 !comment]
[12]
[3]
[]
[ 1 2 !comment ]
[1]
[2]
[]
[ 1 1]
[1]
[1]
[]
[ 1 1 !comment ]
[1]
[1]
[]
[ 16653 2 1]
[16653]
[2]
[]
[ 16654 2 1 ]
[16654]
[2]
[]
[ 16654 2 1 ! comment]
[16654]
[2]
[]
[ 16654 2 1 ! comment ]
[16654]
[2]
[]
As you can see, it's matching all of the expected input formats, but never is able to actually capture the 3rd number, even if it is present.
I'm currently testing this with GCC 5.1.1, but that actual target compiler will be GCC 4.8.2, using boost::regex instead of std::regex.

Let's do a step-by-step processing on the following example.
16653 2 1
^
^ is the currently matched offset. At this point, we're here in the pattern:
\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
^
(I've simplified [[:space:]] to \s and [[:digit:]] to \d for brievty.
\s*? matches, and then (\d+) matches. We end up in the following state:
16653 2 1
^
\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
^
Same thing: \s*? matches, and then (\d+) matches. The state is:
16653 2 1
^
\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
^
Now, things get trickier.
You have a \s*? here, a lazy quantifier. The engine tries to not match anything, and sees if the rest of the pattern will match. So it tries the alternation.
The first alternative is \d+, but it fails, since you don't have a digit at this position.
The second alternative is \s*?, and there are no other alternatives after that. It's lazy, so let's try to match the empty string first.
The next token is !*?, but it also matches the empty string, and it is then followed by .*?, which will match everything up to the end of the string (it does so because you're using regex_match - it would have matched the empty string with regex_search).
At this point, you've reached the end of the pattern successfully, and you got a match, without being forced to match \d+ against the string.
The thing is, this whole part of the pattern ends up being optional:
\s*?(\d+)\s*?(\d+)\s*?(\d+|\s*?)!*?.*?
\__________________/
So, what can you do? You can rewrite your pattern like so:
\s*?(\d+)\s+(\d+)(?:\s+(\d+))?\s*(?:!.*)?
Demo (with added anchors to mimic regex_match behavior)
This way, you're forcing the regex engine to consider \d and not get away with lazy-matching on the empty string. No need for lazy quantifiers since \s and \d are disjoint.
!*?.*? also was suboptimal, since !*? is already covered by the following .*?. I rewrote it as (?:!.*)? to require a ! at the start of a comment, if it's not there the match will fail.

Related

Match every thing between "****" or [****]

I have a regex that look like this:
(?="(test)"\s*:\s*(".*?"|\[.*?]))
to match the value between "..." or [...]
Input
"test":"value0"
"test":["value1", "value2"]
Output
Group1 Group2
test value0
test "value1", "value2" // or - value1", "value2
I there any trick to ignore "" and [] and stick with two group, group1 and group2?
I tried (?="(test)"\s*:\s*(?="(.*?)"|\[(.*?)])) but this gives me 4 groups, which is not good for me.
You may use this conditional regex in PHP with branch reset group:
"(test)"\h*:\h*(?|"([^"]*)"|\[([^]]*)])
This will give you 2 capture groups in both the inputs with enclosing " or [...].
RegEx Demo
RegEx Details:
(?|..) is a branch reset group. Here Subpatterns declared within each alternative of this construct will start over from the same index
(?|"([^"]*)"|\[([^]]*)]) is if-then-else conditional subpatern which means if " is matched then use "([^"]*)" otherwise use \[([^]]*)] subpattern
You can use a pattern like
"(test)"\s*:\s*\K(?|"\K([^"]*)|\[\K([^]]*))
See the regex demo.
Details:
" - a " char
(test) - Group 1: test word
" - a " char
\s*:\s* - a colon enclosed with zero or more whitespaces
\K - match reset operator that clears the current overall match memory buffer (group value is still kept intact)
(?|"\K([^"]*)|\[\K([^]]*)) - a branch reset group:
"\K([^"]*) - matches a ", then discards it, and then captures into Group 2 zero or more chars other than "
| - or
\[\K([^]]*) - matches a [, then discards it, and then captures into Group 2 zero or more chars other than ]
In Java, you can't use \K and ?|, use capturing groups:
String s = "\"test\":[\"value1\", \"value2\"]";
Pattern pattern = Pattern.compile("\"(test)\"\\s*:\\s*(?:\"([^\"]*)|\\[([^\\]]*))");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("Key: " + matcher.group(1));
if (matcher.group(2) != null) {
System.out.println("Value: " + matcher.group(2));
} else {
System.out.println("Value: " + matcher.group(3));
}
}
See a Java demo.

Using RegEx to filter wrong Input?

Look at this example:
string str = "January 19934";
The Outcome should be
Jan 1993
I think I have created the right RegEx ([A-z]{3}).*([\d]{4}) to use in this case but I do not know what I should do now?
How can I extract what I am looking for, using RegEx? Is there a way like receiving 2 variables, the first one being the result of the first RegEx bracket: ([A-z]{3}) and the second result being 2nd bracket:[[\d]{4}]?
Your regex contains a common typo: [A-z] matches more than just ASCII letters. Also, the .* will grab all the string up to its end, and backtracking will force \d{4} match the last 4 digits. You need to use lazy quantifier with the dot, *?.
Then, use regex_search and concat the 2 group values:
#include <regex>
#include <string>
#include <iostream>
using namespace std;
int main() {
regex r("([A-Za-z]{3}).*?([0-9]{4})");
string s("January 19934");
smatch match;
std::stringstream res("");
if (regex_search(s, match, r)) {
res << match.str(1) << " " << match.str(2);
}
cout << res.str(); // => Jan 1993
return 0;
}
See the C++ demo
Pattern explanation:
([A-Za-z]{3}) - Group 1: three ASCII letters
.*? - any 0+ chars other than line break symbols as few as possible
([0-9]{4}) - Group 2: 4 digits
This could work.
([A-Za-z]{3})([a-z ])+([\d]{4})
Note the space after a-z is important to catch space.

How do you find 3 UNIQUE digits in a string of digits?

I am trying to write a regex that is very specific. I want to find 3 digits in a list. The issue comes because I do not care about repeating digits (5, 555, and 55555555555555 are seen as 5). Also, within the 3 digits, they need to be 3 different digits (123 = good, 311 = bad).
Here is what I have so far to find 3 digits, ignoring repeats but it does not specify 3 unique digits.
^(?:([0]{1,}|[1]{1,}|[2]{1,}|[3]{1,}|[4]{1,}|[5]{1,}|[6]{1,}|[7]{1,}|[8]{1,}|[9]{1,}|[0]{1,})(?!.*\\1)){3}$<p>
Here is an example of the types of data I see.
Matching:
458
3333335555111
2222555111
222255558888
111147
9533333333
And not matching:
999999999
222252
888887
Right now my regex will find all of these. How can I ignore any that do not have 3 unique digits?
If your regex-tool of choice supports look-behinds, back-references and possesive matching you could use
^(\d)\1*+(?!.*\1)(\d)\2*+(\d)\3*+$
^ and $ are anchors to ensure, that we check the whole string
(\d) matches a digit into a first capturing group, with \1*+ we possesively match any following occurences of this digit and use the lookbehind (?!.*\1) to ensure, that it doesn't end with that number.
(\d)\2*+ then matches the next different digit, again matching any following occurences possesively (check 122 without the possesive matching to see, why I use it here)
(\d)\3*+ matches the last digit with any following occurences.
Without possesive matching you could make more use of look-behinds, like ^(\d)\1*(?!.*\1)(\d)\2*(?!.*\2)(\d)\3*+$
See https://regex101.com/r/pV2tB2/2 for a demo.
Site Note: Regex might not be the best for this, but as you specifically asked for it - here you are.
This can be done with regex, but it's not the best tool for your work.
Instead of a regex-only approach, you can easily achieve this using Python.
Example:
strings = ['458', '3333335555111', '2222555111', '222255558888', '111147', '9533333333', '955555555', '12222211']
for s in strings:
if len(set(list(s))) == 3:
print "Ok :", s
else:
print "Error :", s
Output:
>> Ok : 458
>> Ok : 3333335555111
>> Ok : 2222555111
>> Ok : 222255558888
>> Ok : 111147
>> Ok : 9533333333
>> Error : 955555555
>> Error : 12222211
I've used the following commands while iterating over the strings inside that list:
list()
set()
len()
Using negative lookahead, this should match any string of digits that contains at least 3 unique digits /^(\d)\1*(?!\1)(\d)(?:\2|\1)*(?!\2|\1)(\d)+$/
(\d) - Match a digit
\1* - Allow that digit to repeat
(?!\1) - Make sure that's followed by a digit that does not match the first match
(\d) - Match the new digit
(?:\2|\1)* - Allow repeats of either the first or second digit
(?!\2|\1) - Make sure that's followed by a digit that does not match the first or second match
(\d)+ - Capture the third unique digit, then allow any number of digits of any kind to follow
I'm not sure if an awk script will do it for you, but here it goes:
awk '
function match_func(num) {
if (match_array[num] == 0)
match_array[num] = 1;
}
{
for (i = 0; i < length($1); i++)
match_func(substr($1, i, 1));
for (i = 0; i < 10; i++)
if (match_array[i] == 1) match_sum++;
if (match_sum == 3)
print $1;
}'

R regular expression issue

I have a dataframe column including pages paths :
pagePath
/text/other_text/123-some_other_txet-4571/text.html
/text/other_text/another_txet/15-some_other_txet.html
/text/other_text/25189-some_other_txet/45112-text.html
/text/other_text/text/text/5418874-some_other_txet.html
/text/other_text/text/text/some_other_txet-4157/text.html
What I want to do is to extract the first number after a /, for example 123 from each row.
To solve this problem, I tried the following :
num = gsub("\\D"," ", mydata$pagePath) /*to delete all characters other than digits */
num1 = gsub("\\s+"," ",num) /*to let only one space between numbers*/
num2 = gsub("^\\s","",num1) /*to delete the first space in my string*/
my_number = gsub( " .*$", "", num2 ) /*to select the first number on my string*/
I thought that what's that I wanted, but I had some troubles, especially with rows like the last row in the example : /text/other_text/text/text/some_other_txet-4157/text.html
So, what I really want is to extract the first number after a /.
Any help would be very welcome.
You can use the following regex with gsub:
"^(?:.*?/(\\d+))?.*$"
And replace with "\\1". See the regex demo.
Code:
> s <- c("/text/other_text/123-some_other_txet-4571/text.html", "/text/other_text/another_txet/15-some_other_txet.html", "/text/other_text/25189-some_other_txet/45112-text.html", "/text/other_text/text/text/5418874-some_other_txet.html", "/text/other_text/text/text/some_other_txet-4157/text.html")
> gsub("^(?:.*?/(\\d+))?.*$", "\\1", s, perl=T)
[1] "123" "15" "25189" "5418874" ""
The regex will match optionally (with a (?:.*?/(\\d+))? subpattern) a part of string from the beginning till the first / (with .*?/) followed with 1 or more digits (capturing the digits into Group 1, with (\\d+)) and then the rest of the string up to its end (with .*$).
NOTE that perl=T is required.
with stringr str_extract, your code and pattern can be shortened to:
> str_extract(s, "(?<=/)\\d+")
[1] "123" "15" "25189" "5418874" NA
>
The str_extract will extract the first 1 or more digits if they are preceded with a / (the / itself is not returned as part of the match since it is a lookbehind subpattern, a zero width assertion, that does not put the matched text into the result).
Try this
\/(\d+).*
Demo
Output:
MATCH 1
1. [26-29] `123`
MATCH 2
1. [91-93] `15`
MATCH 3
1. [132-137] `25189`
MATCH 4
1. [197-204] `5418874`

Extract groups separated by space

I've got following string (example):
Loader[data-prop data-attr="value"]
There can be 1 - n attributes. I want to extract every attribute. (data-prop,data-attr="value"). I tried it in many different ways, for example with \[(?:(\S+)\s)*\] but I didn't get it right. The expression should be written in PREG style..
I suggest grabbing all the key-value pairs with a regex:
'~(?:([^][]*)\b\[|(?!^)\G)\s*(\w+(?:-\w+)*(?:=(["\'])?[^\]]*?\3)?)~'
(see regex demo) and then
See IDEONE demo
$re = '~(?:([^][]*)\b\[|(?!^)\G)\s*(\w+(?:-\w+)*(?:=(["\'])?[^\]]*?\3)?)~';
$str = "Loader[data-prop data-attr=\"value\" more-here='data' and-one-more=\"\"]";
preg_match_all($re, $str, $matches);
$arr = array();
for ($i = 0; $i < count($matches); $i++) {
if ($i != 0) {
$arr = array_merge(array_filter($matches[$i]),$arr);
}
}
print_r(preg_grep('~\A(?![\'"]\z)~', $arr));
Output:
Array
(
[3] => data-prop
[4] => data-attr="value"
[5] => more-here='data'
[6] => and-one-more=""
[7] => Loader
)
Notes on the regex (it only looks too complex):
(?:([^][]*)\b\[|(?!^)\G) - a boundary: we only start matching at a [ that is preceded with a word (a-zA-Z0-9_) character (with \b\[), or right after a successful match (with (?!^)\G). Also, ([^][]*) will capture into Group 1 the part before the [.
\s* - matches zero or more whitespace symbols
(\w+(?:-\w+)*) - captures into Group 2 "words" like "word1" or "word1-word2"..."word1-wordn"
(?:=(["\'])?[^\]]*?\3)? - optional group (due to (?:...)?) matching
= - an equal sign
(["\'])? - Group 3 (auxiliary group to check for the value delimiter) capturing either ", ' or nothing
[^\]]*? - (value) zero or more characters other than ] as few as possible
\3 - the closing ' or " (the same value captured in Group 3).
Since we cannot get rid of capturing ' or ", we can preg_grep all the elements that we are not interested in with preg_grep('~\A(?![\'"]\z)~', $arr) where \A(?![\'"]\z) matches any string that is not equal to ' or ".
how about something like [\s\[]([^\s\]]+(="[^"]+)*)+
gives
MATCH 1: data-prop
MATCH 2: data-attr="value"