I'm looking to have a regex to match all potentially multiline calls to a variadic c function. The end goal is to print the file, line number, and the fourth parameter of each call, but unfortunately I'm not there yet. So far, I have this:
perl -ne 'print if s/^.*?(func1\s*\(([^\)\(,]+||,|\((?2)\))*\)).*?$/$1/s' test.c
with test.c:
int main() {
func1( a, b, c, d);
func1( a, b,
c, d);
func1( func2(), b, c, d, e );
func1( func2(a), b, c, d, e );
return 1;
}
-- which does not match the second call. The reason it doesn't match is that the s at the end of the expression allows . to match newlines, but doesn't seem to allow [..] constructs to match newlines. I'm not sure how to get past this.
I'm also not sure how to reference the fourth parameter in this... the $2, $3 do not get populated in this (and even if they did I imagine I would get some issues due to the recursive nature of the regex).
This should catch your functions, with caveats
perl -0777 -wnE'#f = /(func1\s*\( [^;]* \))\s*;/xg; s/\s+/ /g, say for #f' tt.c
I use the fact that a statement must be terminated by ;. Then this excludes an accidental ; in a comment and it excludes calls to this being nested inside another call. If that is possible then quite a bit more need be done to parse it.
However, further parsing the captured calls, presumably by commas, is complicated by the fact that a nested call may well, and realistically, contain commas. How about
func1( a, b, f2(a2, b2), c, f3(a3, b3), d );
This becomes a far more interesting little parsing problem. Or, how about macros?
Can you clarify what kinds of things one doesn't have to account for?
As the mentioned caveats may be possible to ignore here is a way to parse the argument list, using
Text::Balanced.
Since we need to extract whole function calls if they appear as an argument, like f(a, b), the most suitable function from the library is extract_tagged. With it we can make the opening tag be a word-left-parenthesis (\w+\() and the closing one a right-parenthesis \).
This function extracts only the first occurrence so it is wrapped in extract_multiple
use warnings;
use strict;
use feature 'say';
use Text::Balanced qw(extract_multiple extract_tagged);
use Path::Tiny; # path(). for slurp
my $file = shift // die "Usage: $0 file-to-parse\n";
my #functions = path($file)->slurp =~ /( func1\( [^;]* \) );/xg;
s/\s+/ /g for #functions;
for my $func (#functions) {
my ($args) = $func =~ /func1\s*\(\s* (.*) \s*\)/x;
say $args;
my #parts = extract_multiple( $args, [ sub {
extract_tagged($args, '\\w+\\(', '\\\)', '.*?(?=\w+\()')
} ] );
my #arguments = grep { /\S/ } map { /\(/ ? $_ : split /\s*,\s*/ } #parts;
s/^\s*|\s*\z//g for #arguments;
say "\t$_" for #arguments;
}
The extract_multiple returns parts with the (nested) function calls alone (identifiable by having parens), which are arguments as they stand and what we sought with all this, and parts which are strings with comma-separated groups of other arguments, that are split into individual arguments.
Note the amount of escaping in extract_tagged (found by trial and error)! This is needed because those strings are twice double-quoted in a string-eval. That isn't documented at all, so see the source (eg here).
Or directly produce escape-hungry characters (\x5C for \), which then need no escaping
extract_tagged($_[0], "\x5C".'w+'."\x5C(", '\x5C)', '.*?(?=\w+\()')
I don't know which I'd call "clearer"
I tested on the file provided in the question, to which I added a function
func1( a, b, f2(a2, f3(a3, b3), b2), c, f4(a4, b4), d, e );
For each function the program prints the string with the argument list to parse and the parsed arguments, and the most interesting part of the output is for the above (added) function
[ ... ]
a, b, f2(a2, f3(a3, b3), b2), c, f4(a4, b4), d, e
a
b
f2(a2, f3(a3, b3), b2)
c
f4(a4, b4)
d
e
Not Perl but perhaps simpler:
$ cat >test2.c <<'EOD'
int main() {
func1( a, b, c, d1);
func1( a, b,
c, d2);
func1( func2(), "quotes\"),(", /*comments),(*/ g(b,
c), "d3", e );
func1( func2(a), b, c, d4(p,q,r), e );
func1( a, b, c, func2( func1(a,b,c,d5,e,f) ), g, h);
return 1;
}
EOD
$ cpp -D'func1(a,b,c,d,...)=SHOW(__FILE__,__LINE__,d,)' test2.c |
grep SHOW
SHOW("test2.c",2,d1);
SHOW("test2.c",3,d2)
SHOW("test2.c",5,"d3")
SHOW("test2.c",7,d4(p,q,r));
SHOW("test2.c",8,func2( SHOW("test2.c",8,d5) ));
$
As the final line shows, a bit more work is needed if the function can take itself as an argument.
Related
I would like to force rematch in the following scenario - I'm trying to inverse match a qualifier after each element in a list. In other words I have:
"int a, b, c" =~ m{
(?(DEFINE)
(?<qualifs>\s*(?<qualif>\bint\b|\bfloat\b)\s*+(?{print $+{qualif} . "\n"}))
(?<decl>\s*(?!(?&qualif))(?<ident>[_a-zA-Z][_a-zA-Z0-9]*+)\s*(?{print $+{ident} . "\n"}))
(?<qualifsfacet>\s*\bint\b\s*+)
(?<declfacet>[_a-zA-Z][_a-zA-Z0-9]*+)
)
^((?&qualifsfacet)*+(?!(?&decl))
|(?&qualifs)*+(?&declfacet)
|((?&qualifsfacet)
(?&declfacet)(?<negdecl>\g{lastnegdecl}(,(?&decl)))
|(?&qualifs)*+(?&declfacet)(?<lastnegdecl>\g{negdecl})
(?# Here how to force it to retry last with new lastnegdecl)))$
}xxs;
And would like to have:
a
int
b
int
c
int
As output. Currently it's only this:
a
int
int
I think this might work if there is a way to tell the regex machine to retrigger a match for the new lastnegdecl that is being captured.
Well after some trying I finally figured it out (besides the obvious whitespace issues I had in my original post):
"int a, b, c" =~ m{
(?(DEFINE)
(?<qualifs>\s*+(?<qualif>\bint\b|\bfloat\b)\s*+(?{print $+{qualif} . "\n"}))
(?<decl>\s*+(?!(?&qualif))(?<ident>[_a-zA-Z][_a-zA-Z0-9]*+)\s*(?{print $+{ident} . "\n"}))
(?<qualifsfacet>\s*+(\bint\b|\bfloat\b)\s*+)
(?<declfacet>\s*+[_a-zA-Z][_a-zA-Z0-9]*+\s*+)
)
^((?&qualifsfacet)(?!(?&decl))
|(?&qualifs)*+(?&declfacet)
|(?<restoutter>(?=(?&qualifsfacet)(?&declfacet)
(?<rest>(?(<rest>)\g{rest}),(?&decl)))
((?&qualifs)(?&declfacet)\g{rest}|(?&restoutter)))
|(?&qualifsfacet)(?&declfacet)(,(?&declfacet))*+)$
}xxs;
Basically I'm doing a positive lookahead where decl are called with code but qualifs are not while also concatenating decl inside rest then doing a partial match with the qualifs and the rest and if it doesn't match it goes to do the same thing again. Maybe someone can explain it better but it works. The output of the program above is:
a
int
b
int
c
int
And there is a full match.
Is there a program (like Astyle) which can UN-split lines?
Example:
// These
void foo(
int one,
int two,
double three);
double a = b *
c;
// Becomes this
void foo(int one, int two, double three);
double a = b * c;
Using any decent IDE with regex search replace capabilities you can match this: ,\s*\r\n\s*(.*)
And replace with this: , $1
This assumes \r\n is your newline match and $1 is how you print your first capture.
EDIT:
This is an easy change to your match code if you want to add in some other symbols: [,+*-]\s*\r\n\s*(.*)
One thing you'll have to look out for is commented lines like this:
// blah,
// blah*
// blah
Because the regex will pick them up as well.
I have modified a function lets say
foo(int a, char b); into foo(int a, char b, float new_var);
The old function foo was is called from many places in the existing source code, I want to replace all occurence of foo(some_a, some_b); with foo(some_a, some_b, 0); using a script (manual editing is laborius).
I tried to do the same with grep 's/old/new/g' file but the problem is with using regex to match the parameters, and insert the same into replacement text.
PS: a & b are either valid C variable names OR valid constant values for their types.
Since we are working with C code, we don't need to worry about function overloading issue.
Under the assumption of simple variable names and constant being passed into function call, and that the code was compilable before the change in function prototype:
Search with foo(\([^,]*\),\([^,]*\)) and replace with foo(\1,\2,0).
There are a few caveats, though:
The char field should not contain ','. It will not be replaced if it does.
The char field should not contain ). It will cause bad replacement.
There is no function whose name ends with foo. The replacement may make a mess out of your code if there is.
Search with \(foo([^)]+\) and replace with \1,0 (Thanks to glenn jackman)
There are a few caveats, though:
Both int and char field should not contain ). It will cause bad replacement.
There is no function whose name ends with foo. Similarly, the replacement may make a mess out of your code.
This should make it:
sed -i.bak -r "s/foo\(\s*([0-9]*)\s*,\s*(([0-Z]|'[0-Z]'))\s*\)/foo(\1, \2, 0)/g" file
Note the -i.bak stands for "replace inplace" and create a backup of the original file in file.bak.
foo\(\s*([0-9]*)\s*,\s*(([0-Z]|'[0-Z]'))\s*\) looks for something like foo( + spaces + digits + spaces + , + spaces + [ 'character' or character ] + spaces + ). And replaces with foo( + digits + , + character + 0 + ).
Example
$ cat a
hello foo(23, h);
this is another foo(23, bye);
this is another foo(23, 'b');
and we are calling to_foo(23, 'b'); here
foo(23, 'b'); here
$ sed -i.bak -r "s/foo\(\s*([0-9]*)\s*,\s*(([0-Z]|'[0-Z]'))\s*\)/foo(\1, \2, 0)/g" a
$ cat a
hello foo(23, h, 0);
this is another foo(23, bye);
this is another foo(23, 'b', 0);
and we are calling to_foo(23, 'b', 0); here
foo(23, 'b', 0); here
The regex is: [%{1,2}|\/\/]
This matches %, %% and //
FYI: this is a warning generated from flycheck.
The warning means that you have duplicates in your character class. When you put something between square brackets, it means "one of those": [abc] means any of a, b or c. [a|b] means any of a, | or b.
So when you do [\/\/], you mean "either /, or /" which is obviously a duplicate. For the same reason, [%{1,2}] means "either % or { or 1 or , or 2 or }" which is clearly not what you want.
The group selector are parenthesis, not square brackets, so use this regex instead:
(%{1,2}|\/\/)
Basically I'm trying to replace with whatever is returned from a function call from an object. But I need a return value from the regex search as the argument. It's a little tricky but the code should speak for itself:
while ( $token =~ s/\$P\(([a-z0-9A-Z_]+)\)/$db->getValue("params", qw($1))/e ) { }
The error I'm getting is that $1 is not getting evaluated to anything (the argument literally becomes "$1") so it screws up my getValue() method.
Cheers
The qw() functions quotes "words". I.e. it splits a string at all whitespace characters and returns that list. It does not interpolate.
You can just use the variable "as is":
s/\$P\(([a-z0-9A-Z_]+)\)/$db->getValue("params", $1)/e
The qw() function is very different from
q(abc) (<=> 'abc'),
qq(abc) (<=> "abc"), and
qx(abc) (<=> `abc`) or
qr(abc) (<=> m/abc/):
qw(a b c) <=> ('a', 'b', 'c')