regular expression to match C function call over multiple lines - regex

I am struggling to put together a regex to match a function call like following:
funcname (...(..
...)..(..(...
)...)..)
so the function can have multiple bracketed parameters spread over multiple lines.
The dots can be anything else appart from '(' or ')'.
I would use the regex with sed or grep.
Thanks,
Risto

So, I went on writing this simple parser in bash. It is not perfect but can serve as a starting point. For example it cannot distinguish if a function call is commented out or not, etc.
while read file; do
linenum=0
while IFS= read -r line; do
(( linenum++ ))
if [ $fmatch -eq 0 ]; then
if [[ ! $line =~ $funcname ]]; then
continue
fi
linenummatch=$linenum
fmatch=1
fstripped=0
openbracket=0
closebracket=0
spacenum=0
fi
linelen=${#line}
position=0
while [ $position -lt $linelen ]; do
if [ $fstripped -eq 0 ]; then
subline=${line:$position}
mlen=`expr "$subline" : "$funcname"`
if [ $mlen -gt 0 ]; then
(( position+=mlen ))
resultstr=$funcname
fstripped=1
continue
fi
(( position++ ))
continue
fi
ch=${line:$position:1}
case $ch in
'(' )
(( openbracket++ ))
spacenum=0
newresultstr="$resultstr$ch"
;;
')' )
if [ $openbracket -eq 0 ]; then
fmatch=0
break
fi
(( closebracket++ ))
spacenum=0
newresultstr="$resultstr$ch"
if [ $closebracket -eq $openbracket ]; then
echo "$file $linenummatch $newresultstr"
fmatch=0
break
fi
;;
' ' | '\t' )
if [ $spacenum -eq 0 ]; then
newresultstr=$resultstr' '
fi
(( spacenum++ ))
;;
'\n' )
# line feeds are skipped
;;
* )
if [ $openbracket -eq 0 ]; then
fmatch=0
break
fi
spacenum=0
newresultstr="$resultstr$ch"
;;
esac
resultstr=$newresultstr
(( position++ ))
done
done < $file
done < $filelist

As C is an irregular language you may need a parser for that. The problem you will have is working out when all the open brackets are closed again. You can do some fairly strange things with C. For example you can have a parameter that is a function definition in its own right. For example consider in the following program how you would distinguish between a(), b(), c(), d(), e(), f() and g()?
#include <stdio.h>
#define f(c) c;
char a()
{
return f('z');
}
/*
A function in a comment.
char b()
{
return 'y';
}
*/
char c(char d())
{
return d();
}
#if 0
This code is not included
char g()
{
return 'v';
}
#endif
void main()
{
printf ("A function in a string: char e() { return 'x'; }\n");
printf ("The result from passing a to c: %c\n", c(a));
printf ("Press enter to exit");
getchar();
}
I have seen many attempts to do this kind of thing with Regular Expressions but most of them end up with Catastrophic Backtracking issues.

Related

Perl regex vs. Raku regex, differences in the engine?

I am trying to convert a regex based solution for the knapsack problem from Perl to raku. Details on Perlmonks
The Perl solution creates this regex:
(?<P>(?:vvvvvvvvvv)?)
(?<B>(?:vv)?)
(?<Y>(?:vvvv)?)
(?<G>(?:vv)?)
(?<R>(?:v)?)
0
(?=
(?(?{ $1 })wwww|)
(?(?{ $2 })w|)
(?(?{ $3 })wwwwwwwwwwww|)
(?(?{ $4 })ww|)
(?(?{ $5 })w|)
)
which gets matched against vvvvvvvvvvvvvvvvvvv0wwwwwwwwwwwwwww. After that the match hash %+ contains the items to put in the sack.
My raku conversion is:
$<B> = [ [ vv ]? ]
$<P> = [ [ vvvvvvvvvv ]? ]
$<R> = [ [ v ]? ]
$<Y> = [ [ vvvv ]? ]
$<G> = [ [ vv ]? ]
0
<?before
[ { say "B"; say $/<B>; say $0; say $1; $1 } w || { "" } ]
[ { say "P"; say $/<P>; say $0; say $1; $2 } wwww || { "" } ]
[ { say "R"; say $/<R>; say $0; say $1; $3 } w || { "" } ]
[ { say "Y"; say $/<Y>; say $0; say $1; $4 } wwwwwwwwwwww || { "" } ]
[ { say "G"; say $/<G>; say $0; say $1; $5 } ww || { "" } ]
which also matches vvvvvvvvvvvvvvvvvvv0wwwwwwwwwwwwwww. But the match object, $/ does not contain anything useful. Also, my debug says all say Nil, so at that point the backreference does not seem to work?
Here's my test script:
my $max-weight = 15;
my %items =
'R' => { w => 1, v => 1 },
'B' => { w => 1, v => 2 },
'G' => { w => 2, v => 2 },
'Y' => { w => 12, v => 4 },
'P' => { w => 4, v => 10 }
;
my $str = 'v' x %items.map(*.value<v>).sum ~
'0' ~
'w' x $max-weight;
say $str;
my $i = 0;
my $left = my $right = '';
for %items.keys -> $item-name
{
my $v = 'v' x %items{ $item-name }<v>;
my $w = 'w' x %items{ $item-name }<w>;
$left ~= sprintf( '$<%s> = [ [ %s ]? ] ' ~"\n", $item-name, $v );
$right ~= sprintf( '[ { say "%s"; say $/<%s>; say $0; say $1; $%d } %s || { "" } ]' ~ "\n", $item-name, $item-name, ++$i, $w );
}
use MONKEY-SEE-NO-EVAL;
my $re = sprintf( '%s0' ~ "\n" ~ '<?before ' ~ "\n" ~ '%s>' ~ "\n", $left, $right );
say $re;
dd $/ if $str ~~ m:g/<$re>/;
This answer only covers what's going wrong. It does not address a solution. I have not filed corresponding bugs. I have not yet even searched bug queues to see if I can find reports corresponding to either or both the two issues I've surfaced.
my $lex-var;
sub debug { .say for ++$, :$<rex-var>, :$lex-var }
my $regex = / $<rex-var> = (.) { $lex-var = $<rex-var> } <?before . { debug }> / ;
'xx' ~~ $regex; say $/;
'xx' ~~ / $regex /; say $/;
displays:
1
rex-var => Nil
lex-var => 「x」
「x」
rex-var => 「x」
2
rex-var => Nil
lex-var => 「x」
「x」
Focusing first on the first call of debug (the lines starting with 1 and ending at rex-var => 「x」), we can see that:
Something's gone awry during the call to debug: $<rex-var> is reported as having the value Nil.
When the regex match is complete and we return to the mainline, the say $/ reports a full and correctly populated result that includes the rex-var named match.
To begin to get a sense of what's gone wrong, please consider reading the bulk of my answer to another SO question. You can safely skip the Using ~. Footnotes 1,2, and 6 are also probably completely irrelevant to your scenario.
For the second match, we see that not only is $<rex-var> reported as being Nil during the debug call, the final match variable, as reported back in the mainline with the second say $/, is also missing the rex-var match. And the only difference is that the regex $regex is called from within an outer regex.

What's the error with this BASH regex script?

I'm trying to make a program that reads in n strings and checks them for pertaining to a regex pattern: XXXXX1234X where X is an uppercase character and {1,2,3,4} is any digit. As far as I checked, the regex pattern is correct. The problem seems to be in the input and comparison of strings.
read n
i=0
declare -a str
while [ $i -lt $n ]
do
read 'str[$i]'
i=$((i+1))
done
i=0
while [ $i -lt $n ]
do
[[ $(str[$i]) =~ ^([A-Z]){5}([0-9]){4}([A-Z]){1}$ ]] && echo YES || echo NO
i=$((i+1))
done
I did a minor modification to your code, I replaced the ( and ) with { } in the regex test:
[[ ${str[$i]} =~ ^...
Ran some test and it worked:
#!/bin/bash
read n
i=0
declare -a str
while [ $i -lt $n ]
do
read 'str[$i]'
i=$((i+1))
done
i=0
while [ $i -lt $n ]
do
[[ ${str[$i]} =~ ^([A-Z]){5}([0-9]){4}([A-Z]){1}$ ]] && echo YES || echo NO
i=$((i+1))
done

How should I use exact keyword matching as a condition in the case statement?

I was trying to write myself some handy scripts in order to legitimately slacking off work more efficiently, and this question suddenly popped up:
Given a very long string $LONGEST_EVER_STRING and several keywords strings like $A='foo bar' , $B='omg bbq' and $C='stack overflow'
How should I use exact keyword matching as a condition in the case statement?
for word in $LONGEST_EVER_STRING; do
case $word in
any exact match in $A) do something ;;
any exact match in $B) do something ;;
any exact match in $C) do something ;;
*) do something else;;
esac
done
I know I can write in this way but it looks really ugly:
for word in $LONGEST_EVER_STRING; do
if [[ -n $(echo $A | fgrep -w $word) ]]; then
do something;
elif [[ -n $(echo $B | fgrep -w $word) ]]; then
do something;
elif [[ -n $(echo $C | fgrep -w $word) ]]; then
do something;
else
do something else;
fi
done
Does anyone have an elegant solution? Many thanks!
You could use a function to do a little transform in your A, B, C variables and then:
shopt -s extglob
Ax="+(foo|bar)"
Bx="+(omg|bbq)"
Cx="+(stack|overflow)"
for word in $LONGEST_EVER_STRING; do
case $word in
$Ax) do something ;;
$Bx) do something ;;
$Cx) do something ;;
*) do something else;;
esac
done
I would just define a function for this. It'll be slower than grep for large wordlists, but faster than starting up grep many times.
##
# Success if the first arg is one of the later args.
has() {
[[ $1 = $2 ]] || {
[[ $3 ]] && has "$1" "${#:3}"
}
}
$ has a b c && echo t || echo f
f
$ has a b c a d e f && echo t || echo f
t
A variation on /etc/bashrc's "pathmunge"
for word in $LONGEST_EVER_STRING; do
found_word=false
for list in " $A " " $B " " $C "; do
if [[ $list == *" $word "* ]]; then
found_word=true
stuff with $list and $word
break
fi
done
$found_word || stuff when not found
done

Bash: need to find text within matching braces (parantheses) in text

I have some text that looks like this:
(something1)something2
However something1 and something2 might also have some parentheses inside them such as
(some(thing)1)something(2)
I want to extract something1 (including internal parentheses if there are any) to a variable. Since I can count on the text always starting with an opening parentheses, I'm hoping that I can do something where I match the first parenthesis to the correct closing parentheses, and extract the middle.
Everything I have tried so far has the potential to match the wrong ending parentheses.
If you have perl, the:
perl -MText::Balanced -nlE 'say [Text::Balanced::extract_bracketed( $_, "()" )]->[0]' <<EOF
(something1)something2
(some(thing)1)something(2)
(some(t()()hing)()1)()something(2)
EOF
will prints
(something1)
(some(thing)1)
(some(t()()hing)()1)
Since this is apparently something that is impossible with regular expressions, I have resorted to pickup the the characters 1 by 1:
first=""
count=0
while test -n "$string"
do
char=${string:0:1} # Get the first character
if [[ "$char" == ")" ]]
then
count=$(( $count - 1 ))
fi
if [[ $count > 0 ]]
then
first="$first$char"
fi
if [[ "$char" == "(" ]]
then
count=$(( $count + 1 ))
fi
string=${string:1} # Trim the first character
if [[ $count == 0 ]]
then
second="$string"
string=""
fi
done
You can do it with perl:
echo "(some(thing)1)something(2)" | perl -ne '$_ =~ /(\((?:\(.*\)|[^(])*\))|\w+/s; print $1;'
awk can do it:
#!/bin/awk -f
{
for (i=1; i<=length; ++i) {
if (numLeft == 0 && substr($0, i, 1) == "(") {
leftPos = i
numLeft = 1
} else if (substr($0, i, 1) == "(") {
++numLeft
} else if (substr($0, i, 1) == ")") {
++numRight
}
if (numLeft && numLeft == numRight) {
print substr($0, leftPos, i-leftPos+1)
next
}
}
}
Input:
(something1)something2
(some(thing)1)something(2)
Output:
(something1)
(some(thing)1)

How to automagically create pattern based on real data?

I have many vendors in database, they all differ in some aspect of their data. I'd like to make data validation rule which is based on previous data.
Example:
A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12
Goal: if user inputs string 'XZ-217' for vendor B, algorithm should compare previous data and say: this string is not similar to vendor B previous data.
Is there some good way/tools to achieve such comparison? Answer could be some generic algoritm or Perl module.
Edit:
The "similarity" is hard to define, i agree. But i'd like to catch to algorithm, which could analyze previous ca 100 samples and then compare the outcome of analyze with new data. Similarity may based on length, on use of characters/numbers, string creation patterns, similar beginning/end/middle, having some separators in.
I feel it is not easy task, but on other hand, i think it has very wide use. So i hoped, there is already some hints.
You may want to peruse:
http://en.wikipedia.org/wiki/String_metric and http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm (for instance)
Joel and I came up with similar ideas. The code below differentiates 3 types of zones.
one or more non-word characters
alphanumeric cluster
a cluster of digits
It creates a profile of the string and a regex to match input. In addition, it also contains logic to expand existing profiles. At the end, in the task sub, it contains some pseudo logic which indicates how this might be integrated into a larger application.
use strict;
use warnings;
use List::Util qw<max min>;
sub compile_search_expr {
shift;
#_ = #{ shift() } if #_ == 1;
my $str
= join( '|'
, map { join( ''
, grep { defined; }
map {
$_ eq 'P' ? quotemeta;
: $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
: $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
: undef
;
} #$_
)
} #_ == 1 ? #{ shift } : #_
);
return qr/^(?:$str)$/;
}
sub merge_profiles {
shift;
my ( $profile_list, $new_profile ) = #_;
my $found = 0;
PROFILE:
for my $profile ( #$profile_list ) {
my $profile_length = #$profile;
# it's not the same profile.
next PROFILE unless $profile_length == #$new_profile;
my #merged;
for ( my $i = 0; $i < $profile_length; $i++ ) {
my $old = $profile->[$i];
my $new = $new_profile->[$i];
next PROFILE unless $old->[0] eq $new->[0];
push( #merged
, [ $old->[0]
, min( $old->[1], $new->[1] )
, max( $old->[2], $new->[2] )
]);
}
#$profile = #merged;
$found = 1;
last PROFILE;
}
push #$profile_list, $new_profile unless $found;
return;
}
sub compute_info_profile {
shift;
my #profile_chunks
= map {
/\W/ ? [ P => $_ ]
: /\D/ ? [ W => length, length ]
: [ D => length, length ]
}
grep { length; } split /(\W+)/, shift
;
}
# Psuedo-Perl
sub process_input_task {
my ( $application, $input ) = #_;
my $patterns = $application->get_patterns_for_current_customer;
my $regex = $application->compile_search_expr( $patterns );
if ( $input =~ /$regex/ ) {}
elsif ( $application->approve_divergeance( $input )) {
$application->merge_profiles( $patterns, compute_info_profile( $input ));
}
else {
$application->escalate(
Incident->new( issue => INVALID_FORMAT
, input => $input
, customer => $customer
));
}
return $application->process_approved_input( $input );
}
Here is my implementation and a loop over your test cases. Basically you give a list of good values to the function and it tries to build a regex for it.
output:
A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})
code:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw'uniq each_arrayref';
my %examples = (
A => [qw/ XZ-4 XZ-23 XZ-217 /],
B => [qw/ 1276 1899 22711 /],
C => [qw/ 12-4 12-75 12 /],
);
foreach my $example (sort keys %examples) {
print "$example: ", gen_regex(#{ $examples{$example} }) || "Generate failed!", "\n";
}
sub gen_regex {
my #cases = #_;
my %exploded;
# ex. $case may be XZ-217
foreach my $case (#cases) {
my #parts =
grep { defined and length }
split( /(\d+|\w+)/, $case );
# #parts are ( XZ, -, 217 )
foreach (#parts) {
if (/\d/) {
# 217 becomes ['\d' => 3]
push #{ $exploded{$case} }, ['\d' => length];
} elsif (/\w/) {
#XZ becomes ['\w' => 2]
push #{ $exploded{$case} }, ['\w' => length];
} else {
# - becomes ['lit' => '-']
push #{ $exploded{$case} }, ['lit' => $_ ];
}
}
}
my $pattern = '';
# iterate over nth element (part) of each case
my $ea = each_arrayref(values %exploded);
while (my #parts = $ea->()) {
# remove undefined (i.e. optional) parts
my #def_parts = grep { defined } #parts;
# check that all (defined) parts are the same type
my #part_types = uniq map {$_->[0]} #def_parts;
if (#part_types > 1) {
warn "Parts not aligned\n";
return;
}
my $type = $part_types[0]; #same so make scalar
# were there optional parts?
my $required = (#parts == #def_parts);
# keep the values of each part
# these are either a repitition or lit strings
my #values = sort uniq map { $_->[1] } #def_parts;
# these are for non-literal quantifiers
my $min = $required ? $values[0] : 0;
my $max = $values[-1];
# write the specific pattern for each type
if ($type eq '\d') {
$pattern .= '\d' . "{$min,$max}";
} elsif ($type eq '\w') {
$pattern .= '\w' . "{$min,$max}";
} elsif ($type eq 'lit') {
# quote special characters, - becomes \-
my #uniq = map { quotemeta } uniq #values;
# join with alternations, surround by non-capture grouup, add quantifier
$pattern .= '(?:' . join('|', #uniq) . ')' . ($required ? '{1}' : '?');
}
}
# build the qr regex from pattern
my $regex = qr/$pattern/;
# test that all original patterns match (#fail should be empty)
my #fail = grep { $_ !~ $regex } #cases;
if (#fail) {
warn "Some cases fail for generated pattern $regex: (#fail)\n";
return '';
} else {
return $regex;
}
}
To simplify the work of finding the pattern, optional parts may come at the end, but no required parts may come after optional ones. This could probably be overcome but it might be hard.
If there was a Tie::StringApproxHash module, it would fit the bill here.
I think you're looking for something that combines the fuzzy-logic functionality of String::Approx and the hash interface of Tie::RegexpHash.
The former is more important; the latter would make light work of coding.