Avoid empty elements in match when optional substrings are not present - c++

I am trying to create a regex that match the strings returned by diff terminal command.
These strings start with a decimal number, might have a substring composed by a comma and a number, then a mandatory character (a, c, d) another mandatory decimal number followed by another optional group as the one before.
Examples:
27a27
27a27,30
28c28
28,30c29,31
1d1
1,10d1
I am trying to extract all the groups separately but the optional ones without ,.
I am doing this in C++:
#include<iostream>
#include<string>
#include<fstream>
#include <regex>
using namespace std;
int main(int argc, char* argv[])
{
string t = "47a46";
std::string result;
std::regex re2("(\\d+)(?:,(\\d+))?([acd])(\\d+)(?:,(\\d+))?");
std::smatch match;
std::regex_search(t, match, re2);
cout<<match.size()<<endl;
cout<<match.str(0)<<endl;
if (std::regex_search(t, match, re2))
{
for (int i=1; i<match.size(); i++)
{
result = match.str(i);
cout<<i<<":"<<result<< " ";
}
cout<<endl;
}
return 0;
}
The string variable t is the string I want to manipulate.
My regular expression
(\\d+)(?:,(\\d+))?([acd])(\\d+)(?:,(\\d+))?
is working but with strings that do not have the optional subgroups (such as 47a46, the match variable will contain empty elements in the corresponding position of the expected substrings.
For example in the program above the elements of match (preceded by their index) are:
1:47 2: 3:a 4:46 5:
Elements in position 2 and 5 correspond to the optional substring that in this case are not present so I would like match to avoid retrieving them so that it would be:
1:47 2:a 3:46
How can I do it?

I think the best RE for you would be like this:
std::regex re2(R"((\d+)(?:,\d+)?([a-z])(\d+)(?:,\d+)?)");
- that way it should match all the required groups (but optional)
output:
4
47a46
1:47 2:a 3:46
Note: the re2's argument string is given in c++11 notation.
EDIT: simplified RE a bit

Related

Regex to replace single occurrence of character in C++ with another character

I am trying to replace a single occurrence of a character '1' in a String with a different character.
This same character can occur multiple times in the String which I am not interested in.
For example, in the below string I want to replace the single occurrence of 1 with 2.
input:-0001011101
output:-0002011102
I tried the below regex but it is giving be wrong results
regex b1("(1){1}");
S1=regex_replace( S,
b1, "2");
Any help would be greatly appreciated.
If you used boost::regex, Boost regex library, you could simply use a lookaround-based solution like
(?<!1)1(?!1)
And then replace with 2.
With std::regex, you cannot use lookbehinds, but you can use a regex that captures either start of string or any one char other than your char, then matches your char, and then makes sure your char does not occur immediately on the right.
Then, you may replace with $01 backreference to Group 1 (the 0 is necessary since the $12 replacement pattern would be parsed as Group 12, an empty string here since there is no Group 12 in the match structure):
regex reg("([^1]|^)1(?!1)");
S1=std::regex_replace(S, regex, "$012");
See the C++ demo online:
#include <iostream>
#include <regex>
int main() {
std::string S = "-0001011101";
std::regex reg("([^1]|^)1(?!1)");
std::cout << std::regex_replace(S, reg, "$012") << std::endl;
return 0;
}
// => -0002011102
Details:
([^1]|^) - Capturing group 1: any char other than 1 ([^...] is a negated character class) or start of string (^ is a start of string anchor)
1 - a 1 char
(?!1) - a negative lookahead that fails the match if there is a 1 char immediately to the right of the current location.
Use a negative lookahead in the regexp to match a 1 that isn't followed by another 1:
regex b1("1(?!1)");

C++: Matching regex, what is in smatch? [duplicate]

This question already has an answer here:
What is returned in std::smatch and how are you supposed to use it?
(1 answer)
Closed 2 years ago.
I'm using a modified regex example from Stroustrup C++ 4th Ed. Page 127 & 128. I'm trying to understand what is in the vector smatch matches.
$ ./a.out
AB00000-0000
AB00000-0000.-0000.
$ ./a.out
AB00000
AB00000..
It seems like the matches in parenthesis () appear in match[1], match[2], ... which the total match appears in match[0].
Appreciate any insight into this.
#include <iostream>
#include <regex>
using namespace std;
int main(int argc, char *argv[])
{
// ZIP code pattern: XXddddd-dddd and variants
regex pat (R"(\w{2}\s*\d{5}(-\d{4})?)");
for (string line; getline(cin,line);) {
smatch matches; // matched strings go here
if (regex_search(line, matches, pat)) { // search for pat in line
for (auto p : matches) {
cout << p << ".";
}
}
cout << endl;
}
return 0;
}
The type of matches is a std::match_results, not a vector, but it does have an operator[].
From the reference:
If n == 0, returns a reference to the std::sub_match representing the part of the target sequence matched by the entire matched regular expression.
If n > 0 and n < size(), returns a reference to the std::sub_match representing the part of the target sequence that was matched by the nth captured marked subexpression).
where n is the argument to operator[]. So matches[0] contains the entire matched expression, and matches[1], matches[2], ... contain consecutive capture group expressions.

Getting a list of curly brace blocks using regex

I'm building a simple data encoder/decoder for a project I'm doing in c++, the data is written to a file in this format (dummy data):
{X143Y453CGRjGeBK}{X243Y6789CaRyGwBk}{X5743Y12CvRYGQBs}
The number of blocks is indefinite and the size of the blocks is variable.
To decode the image I need to iterate through each curly brace block and process the data within, the ideal output would look like this:
"X143Y453CGRjGeBK" "X243Y6789CaRyGwBk" "X5743Y12CvRYGQBs"
The closest I've got is:
"\\{(.*)\\}"
But this gives me the whole sequence rather than each block.
Sorry if this is a simple problem but regex hasn't really clicked with me yet, is this possible with regex or should I use a different method?
You can use [^{}]+:
[^{}]: Match a single character not present in the list below (in this case '{' & '}')
\+: once you match that character, match one and unlimited times as many as possible.
Testing: https://regex101.com/r/bNOK5U/1/
To extract multiple occurrences of substrings inside curly braces, that have no braces inside (that is, substrings inside innermost braces), you may use
#include <iostream>
#include <string>
#include <vector>
#include <regex>
int main() {
std::regex rx(R"(\{([^{}]*)})");
std::string s = "Text here {X143Y453CGRjGeBK} and here {X243Y6789CaRyGwBk}{X5743Y12CvRYGQBs} and more here.";
std::vector<std::string> results(std::sregex_token_iterator(s.begin(), s.end(), rx, 1),
std::sregex_token_iterator());
for( auto & p : results ) std::cout << p << std::endl;
return 0;
}
See a C++ demo.
The std::regex rx(R"(\{([^{}]*)})") regex string is \{([^{}]*)}, and it matches
\{ - a { char
([^{}]*) - Capturing group 1: zero or more chars other than { and }
} - a } char.
The 1 argument passed to the std::sregex_token_iterator extracts just thiose values that are captured into Group 1.

c++ regexp allowing digits separated by dot

i need rexexp allowing up to two digits in a row separated by dots, like 1.2 or 1.2.3 or 1.2.3.45 etc., but not 1234 or 1.234 etc. I'm trying this "^[\d{1,2}.]+", but it allows all numbers. What's wrong?
You may try this:
^\d{1,2}(\.\d{1,2})+$
Regex 101 Demo
Explanation:
^ start of a string
\d{1,2} followed by one or two digits
( start of capture group
\.\d{1,2} followed by a dot and one or two digits
) end of capture group
+ indicates the previous capture group be repeated 1 or more times
$ end of string
Sample C++ Source (run here):
#include <regex>
#include <string>
#include <iostream>
using namespace std;
int main()
{
string regx = R"(^\d{1,2}(\.\d{1,2})+$)";
string input = "1.2.346";
smatch matches;
if (regex_search(input, matches, regex(regx)))
{
cout<<"match found";
}
else
cout<<"No match found";
return 0;
}
I think the last should not have more than 2 digits.
(\d{1,2}\.)+\d{1,2}(?=\b)

Ignore String containing special words (Months)

I am trying to find alphanumeric strings by using the following regular expression:
^(?=.*\d)(?=.*[a-zA-Z]).{3,90}$
Alphanumeric string: an alphanumeric string is any string that contains at least a number and a letter plus any other special characters it can be # - _ [] () {} ç _ \ ù %
I want to add an extra constraint to ignore all alphanumerical strings containing the following month formats :
JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
One solution is to actually match an alphanumerical string. Then check if this string contains one of these names by using the following function:
vector<string> findString(string s)
{
vector<string> vec;
boost::regex rgx("JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre
");
boost::smatch match;
boost::sregex_iterator begin {s.begin(), s.end(), rgx},
end {};
for (boost::sregex_iterator& i = begin; i != end; ++i)
{
boost::smatch m = *i;
vec.push_back(m.str());
}
return vec;
}
Question: How can I add this constraint directly into the regular expression instead of using this function.
One solution is to use negative lookahead as mentioned in How to ignore words in string using Regular Expressions.
I used it as follows:
String : 2-hello-001
Regular expression : ^(?=.*\d)(?=.*[a-zA-Z]^(?!Jan|Feb|Mar)).{3,90}$
Result: no match
Test website: http://regexlib.com/
The edit provided by #Robin and #RyanCarlson : ^[][\w#_(){}ç\\ù%-]{3,90}$ works perfectly in detecting alphanumeric strings with special characters. It's just the negative lookahead part that isn't working.
You can use negative look ahead, the same way you're using positive lookahead:
(?=.*\d)(?=.*[a-zA-Z])
(?!.*(?:JANVIER|FEVRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AOUT|SEPTEMBRE|OCTOBRE|NOVEMBRE|DECEMBRE|Jan|Feb|Mar|Apr|May|Jun|JUN|Jul|Aug|Sep|Oct|Nov|Dec|[jJ]anvier|[fF][ée]vrier|[mM]ars|[aA]vril|[mM]ai|[jJ]uin|[jJ]uillet|[aA]o[éû]t|aout|[sS]eptembre|[oO]ctobre|[nN]ovembre|[dD][eé]cembre)).{3,90}$
Also you regex is pretty unclear. If you want alphanumerical strings with a length between 3 and 90, you can just do:
/^(?!.*(?:JANVIER|F[Eé]VRIER|MARS|AVRIL|MAI|JUIN|JUILLET|AO[Uù]T|SEPTEMBRE|OCTOBRE|NOVEMBRE|D[Eé]CEMBRE|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))
[][\w#_(){}ç\\ù%-]{3,90}$/i
the i flag means it will match upper and lower case (so you can reduce your forbidden list), \w is a shortcut for [0-9a-zA-Z_] (careful if you copy-paste, there's a linebreak here for readability between (?! ) and [ ]). Just add in the final [...] whatever special characters you wanna match.