Find regex matches & remove outer part of the match - regex

I have a string
content = "std::cout << func(some_val) << std::endl; auto i = func(some_other_val);"
and I find to find all instances with func(...), and remove the function call. So that I would get
content = "std::cout << some_val << std::endl; auto i = some_other_val;"
So I've tried this:
import re
content = "std::cout << func(some_val) << std::endl; auto i = func(some_other_val);"
c = re.compile('func\([a-zA-Z0-9_]+\)')
print(c.sub('', content)) # gives "std::cout << << std::endl; auto i = ;"
but this removes the entire match, not just the func( and ).
Basically, how do I keep whatever matched with [a-zA-Z0-9_]+?

You can use re.sub to replace all the outer func(...) with only the value like below, See regex here , Here I've used [w]+, you can do changes if you use
import re
regex = r"func\(([\w]+)\)"
test_str = "std::cout << func(some_val) << std::endl; auto i = func(some_other_val);"
subst = "\\1"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Demo: https://rextester.com/QZJLF65281
Output:
std::cout << some_val << std::endl; auto i = some_other_val;

You should capture the part of the match that you want to keep into a group:
re.compile(r'func\(([a-zA-Z0-9_]+)\)')
Here I captured it into group 1.
And then you can refer to group 1 with \1:
print(c.sub(r'\1', content))
Note that in general, you should not use regex to parse source code of a non-regular language (such as C in this case) with regex. It might work in a few very specific cases, where the input is very limited, but you should still use a C parser to parse C code. I have found libraries such as this and this.

Related

Regex (JS Notation): Select spaces not in [ [], {}, "" ] to tokenize string

So I need to tokenize a string by all spaces not between quotes, I am using regex in Javascript notation.
For example:
" Test Test " ab c " Test" "Test " "Test" "T e s t"
becomes
[" Test Test ",ab,c," Test","Test ","Test","T e s t"]
For my use case however, the solution should work in the following test setting:
https://www.regextester.com/
All Spaces not within quotes should be highlighted in the above setting. If they are highlighted in the above setting they would be parsed correctly in my program.
For more specificity, I am using Boost::Regex C++ to do the parsing as follows:
...
std::string test_string("\" Test Test \" ab c \" Test\" \"Test \" \"Test\" \"T e s t\"");
// (,|;)?\\s+ : Split on ,\s or ;\s
// (?![^\\[]*\\]) : Ignore spaces inside []
// (?![^\\{]*\\}) : Ignore spaces inside {}
// (?![^\"].*\") : Ignore spaces inside "" !!! MY ATTEMPT DOESN'T WORK !!!
//Note the below regex delimiter declaration does not include the erroneous regex.
boost::regex delimiter("(,|;\\s|\\s)+(?![^\\[]*\\])(?![^\\(]*\\))(?![^\\{]*\\})");
std::vector<std::string> string_vector;
boost::split_regex(string_vector, test_string, delimiter);
For those of you who do not use Boost::regex or C++ the above link should enable testing of viable regex for the above use case.
Thank you all for you assistance I hope you can help me with the above problem.
I would 100% not use regular expressions for this. First off, because it's way easier to express as a PEG grammar instead. E.g.:
std::vector<std::string> tokens(std::string_view input) {
namespace x3 = boost::spirit::x3;
std::vector<std::string> r;
auto atom //
= '[' >> *~x3::char_(']') >> ']' //
| '{' >> *~x3::char_('}') >> '}' //
| '"' >> *~x3::char_('"') >> '"' //
| x3::graph;
auto token = x3::raw[*atom];
parse(input.begin(), input.end(), token % +x3::space, r);
return r;
}
This, off the bat, already performs as you intend:
Live On Coliru
int main() {
for (std::string const input : {R"(" Test Test " ab c " Test" "Test " "Test" "T e s t")"}) {
std::cout << input << "\n";
for (auto& tok : tokens(input))
std::cout << " - " << quoted(tok, '\'') << "\n";
}
}
Output:
" Test Test " ab c " Test" "Test " "Test" "T e s t"
- '" Test Test "'
- 'ab'
- 'c'
- '" Test"'
- '"Test "'
- '"Test"'
- '"T e s t"'
BONUS
Where this really makes the difference, is when you realize that you wanted to be able to handle nested constructs (e.g. "string" [ {1,2,"3,4", [true,"more [string]"], 9 }, "bye ]).
Regular expressions are notoriously bad at this. Spirit grammar rules can be recursive though. If you make your grammar description more explicit I could show you examples.
You can use multiple regexes if you are ok with that. The idea is to replace spaces inside quotes with a non-printable char (\x01), and restore them after the split:
const input = `" Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
.replace(/"[^"]*"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
.split(/ +/) // split on spaces
.map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);
If you have escaped quotes within a string, such as "a \"quoted\" token" you can use this regex instead:
const input = `"A \"quoted\" token" " Test Test " ab c " Test" "Test " "Test" "T e s t"`;
let result = input
.replace(/".*?[^\\]"/g, m => m.replace(/ /g, '\x01')) // replace spaces inside quotes
.split(/ +/) // split on spaces
.map(s => s.replace(/\x01/g, ' ')); // restore spaces inside quotes
console.log(result);
If you want to parse nested brackets you need a proper language parser. You can also do that with regexes however: Parsing JavaScript objects with functions as JSON
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex

Regex to match strings not enclosed in macro

In a development context, I would like to make sure all strings in source files within certain directories are enclosed in some macro "STR_MACRO". For this I will be using a Python script parsing the source files, and I would like to design a regex for detecting non-commented lines with strings not enclosed in this macro.
For instance, the regex should match the following strings:
std::cout << "Hello World!" << std::endl;
load_file("Hello World!");
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar
Excluding commented lines containing strings seems to work well using the regex ^(?!\s*//).*"([^"]+)". However when I try to exclude non-commented strings already enclosed in the macro, using the regex ^(?!\s*//).*(?!STR_MACRO\()"([^"]+)", it does nothing more (seemingly due to with the opening parenthesis after STR_MACRO).
Any hints on how to achieve this?
With PyPi regex module (that you can install with pip install regex in the terminal) you can use
import regex
pattern = r'''(?:^//.*|STR_MACRO\("[^"\\]*(?:\\.[^"\\]*)*"\))(*SKIP)(*F)|"[^"\\]*(?:\\.[^"\\]*)*"'''
text = r'''For instance, the regex should match the following strings:
std::cout << "Hello World!" << std::endl;
load_file("Hello World!");
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar'''
print( regex.sub(pattern, r'STR_MACRO(\g<0>)', text, flags=regex.M) )
Details:
(?:^//.*|STR_MACRO\("[^"\\]*(?:\\.[^"\\]*)*"\))(*SKIP)(*F) - // at the line start and the rest of the line, or STR_MACRO( + a double quoted string literal pattern + ), and then the match is skipped, and the next match search starts at the failure location
| - or
"[^"\\]*(?:\\.[^"\\]*)*" - ", zero or more chars other than " and \, then zero or more reptitions of a \ and then any single char followed with zero or more chars other than a " and \ chars, and then a " char
See the Python demo. Output:
For instance, the regex should match the following strings:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
But not the following ones:
std::cout << STR_MACRO("Hello World!") << std::endl;
load_file(STR_MACRO("Hello World!"));
// "foo" bar

QRegularExpression find and capture all quoted and non-quoated parts in string

I am fairly new to using regexes.
I got a string which can contain quoted and not quoted substrings.
Here are examples of how they could look:
"path/to/program.exe" -a -b -c
"path/to/program.exe" -a -b -c
path/to/program.exe "-a" "-b" "-c"
path/to/program.exe "-a" -b -c
My regex looks like this: (("[^"]*")|([^"\t ]+))+
With ("[^"]+") I attempt to find every quoted substring and capture it.
With ([^"\t ]+) I attempt to find every substring without quotes.
My code to test this behaviour looks like this:
QString toMatch = R"del( "path/to/program.exe" -a -b -c)del";
qDebug() << "String to Match against: " << toMatch << "\n";
QRegularExpression re(R"del((("[^"]+")|([^"\t ]+))+)del");
QRegularExpressionMatchIterator it = re.globalMatch(toMatch);
int i = 0;
while (it.hasNext())
{
QRegularExpressionMatch match = it.next();
qDebug() << "iteration: " << i << " captured: " << match.captured(i) << "\n";
i++;
}
Output:
String to Match against: " \"path/to/program.exe\" -a -b -c"
iteration: 0 captured: "\"path/to/program.exe\""
iteration: 1 captured: "-a"
iteration: 2 captured: ""
iteration: 3 captured: "-c"
Testing it in Regex101 shows me the result I want.
I also tested it on some other websites e.g this.
I guess I am doing something wrong, could anyone point in the right direction?
Thanks in advance.
You assume that the groups you need to get value from will change their IDs with each new match, while, in fact, all the groups IDs are set in the pattern itself.
I suggest removing all groups and just extract the whole match value:
QString toMatch = R"del( "path/to/program.exe" -a -b -c)del";
qDebug() << "String to Match against: " << toMatch << "\n";
QRegularExpression re(R"del("[^"]+"|[^"\s]+)del");
QRegularExpressionMatchIterator it = re.globalMatch(toMatch);
while (it.hasNext())
{
QRegularExpressionMatch match = it.next();
qDebug() << " matched: " << match.captured(0) << "\n";
}
Note the "[^"]+"|[^"\s]+ pattern matches either
"[^"]+" - ", then one or more chars other than " and then a "
| - or
[^"\s]+ - one or more chars other than " and whitespace.
See the updated pattern demo.

C++ Regex: non-greedy match

I'm currently trying to make a regex which matches URL parameters and extracts them.
For example, if I got the following parameters string ?param1=someValue&param2=someOtherValue, std::regex_match should extract the following contents:
param1
some_content
param2
some_other_content
After trying different regex patterns, I finally built one corresponding to what I want: std::regex("(?:[\\?&]([^=&]+)=([^=&]+))*").
If I take the previous example, std::regex_match matches as expected. However, it does not extract the expected values, keeping only the last captured values.
For example, the following code:
std::regex paramsRegex("(?:[\\?&]([^=&]+)=([^=&]+))*");
std::string arg = "?param1=someValue&param2=someOtherValue";
std::smatch sm;
std::regex_match(arg, sm, paramsRegex);
for (const auto &match : sm)
std::cout << match << std::endl;
will give the following output:
param2
someOtherValue
As you can see, param1 and its value are skipped and not captured.
After searching on google, I've found that this is due to greedy capture and I have modified my regex into "(?:[\\?&]([^=&]+)=([^=&]+))\\*?" in order to enable non-greedy capturing.
This regex works well when I try it on rubular but it does not match when I use it in C++ (std::regex_match returns false and nothing is captured).
I've tried different std::regex_constants options (different regex grammar by using std::regex_constants::grep, std::regex_constants::egrep, ...) but the result is the same.
Does someone know how to do non-greedy regex capture in C++?
As Casimir et Hippolyte explained in his comment, I just need to:
remove the quantifier
Use std::regex_iterator
It gives me the following code:
std::regex paramsRegex("[\\?&]([^=]+)=([^&]+)");
std::string url_params = "?key1=val1&key2=val2&key3=val3&key4=val4";
std::smatch sm;
auto params_it = std::sregex_iterator(url_params.cbegin(), url_params.cend(), paramsRegex);
auto params_end = std::sregex_iterator();
while (params_it != params_end) {
auto param = params_it->str();
std::regex_match(param, sm, paramsRegex);
for (const auto &s : sm)
std::cout << s << std::endl;
++params_it;
}
And here is the output:
?key1=val1
key1
val1
&key2=val2
key2
val2
&key3=val3
key3
val3
&key4=val4
key4
val4
The orignal regex (?:[\\?&]([^=&]+)=([^=&]+))* was just changed into [\\?&]([^=]+)=([^&]+).
Then, by using std::sregex_iterator, I get an iterator on each matching groups (?key1=val1, &key2=val2, ...).
Finally, by calling std::regex_match on each sub-string, I can retrieve parameters values.
Try to use match_results::prefix/suffix:
string match_expression("your expression");
smatch result;
regex fnd(match_expression, regex_constants::icase);
while (regex_search(in_str, result, fnd, std::regex_constants::match_any))
{
for (size_t i = 1; i < result.size(); i++)
{
std::cout << result[i].str();
}
in_str = result.suffix();
}

c++ regex substring wrong pattern found

I'm trying to understand the logic on the regex in c++
std::string s ("Ni Ni Ni NI");
std::regex e ("(Ni)");
std::smatch sm;
std::regex_search (s,sm,e);
std::cout << "string object with " << sm.size() << " matches\n";
This form shouldn't give me the number of substrings matching my pattern? Because it always give me 1 match and it says that the match is [Ni , Ni]; but i need it to find every single pattern; they should be 3 and like this [Ni][Ni][Ni]
The function std::regex_search only returns the results for the first match found in your string.
Here is a code, merged from yours and from cplusplus.com. The idea is to search for the first match, analyze it, and then start again using the rest of the string (that is to say, the sub-string that directly follows the match that was found, which can be retrieved thanks to match_results::suffix ).
Note that the regex has two capturing groups (Ni*) and ([^ ]*).
std::string s("the knights who say Niaaa and Niooo");
std::smatch m;
std::regex e("(Ni*)([^ ]*)");
while (std::regex_search(s, m, e))
{
for (auto x : m)
std::cout << x.str() << " ";
std::cout << std::endl;
s = m.suffix().str();
}
This gives the following output:
Niaaa Ni aaa
Niooo Ni ooo
As you can see, for every call to regex_search, we have the following information:
the content of the whole match,
the content of every capturing group.
Since we have two capturing groups, this gives us 3 strings for every regex_search.
EDIT: in your case if you want to retrieve every "Ni", all you need to do is to replace
std::regex e("(Ni*)([^ ]*)");
with
std::regex e("(Ni)");
You still need to iterate over your string, though.