I want to use a perl regex to remove the outer brackets in a function but I can't construct a regex that doesn't interfere with the inner brackets . Here is an example:
void init(){
if(true){
//do something
}
}
into
void init()
if(true){
//do something
}
is there a regex that can do this?
Write a parser for the language. Here's a simplified example using Marpa::R2:
#!/usr/bin/perl
use warnings;
use strict;
use Marpa::R2;
my $input = << '__IN__';
void init(){
if(true){
//do something
}
}
__IN__
my $dsl = << '__DSL__';
:default ::= action => concat
lexeme default = latm => 1
FuncDef ::= type name Arglist ('{') Body ('}')
Arglist ::= '(' Args ')'
Args ::= Arg* separator => comma
Arg ::= type name
Body ::= Block+
Block ::= nonbrace
| '{' nonbrace '}'
nonbrace ~ [^{}]*
comma ~ ','
type ~ 'void'
name ~ [\w]+
space ~ [\s]+
:discard ~ space
__DSL__
sub concat { shift; join ' ', #_ }
my $grammar = 'Marpa::R2::Scanless::G'->new({ source => \$dsl });
my $value = $grammar->parse(\$input, { semantics_package => 'main' });
print $$value;
The curly brackets at FuncDef are parenthesized, which tells Marpa to discard them.
Here it is:
my $s = "void init(){ if(true){ //do something }}";
$s =~ s/^([^{]+)\{(.*)\}([^{]*)$/$1$2$3/s;
print "$s\n";
Related
Problem in parsing/identifying double quoted string from the big expression.
use strict;
use Marpa::R2;
use Data::Dumper;
my $grammar = Marpa::R2::Scanless::G->new({
default_action => '[values]',
source => \(<<'END_OF_SOURCE'),
:start ::= expression
expression ::= expression OP expression
expression ::= expression COMMA expression
expression ::= func LPAREN PARAM RPAREN
expression ::= PARAM
PARAM ::= STRING | REGEX_STRING
:discard ~ sp
sp ~ [\s]+
COMMA ~ [,]
STRING ~ [^ \/\(\),&:\"~]+
REGEX_STRING ~ yet to identify
OP ~ ' - ' | '&'
LPAREN ~ '('
RPAREN ~ ')'
func ~ 'func'
END_OF_SOURCE
});
my $recce = Marpa::R2::Scanless::R->new({grammar => $grammar});
my $input1 = "func(foo)&func(bar)"; -> able to parse it properly by parsing foo and bar as STRING LEXEME.
my $input2 = "\"foo\""; -> Here, I want to parse foo as regex_string LEXEME. REGEX_STRING is something which is enclosed in double quotes.
my $input3 = "func(\"foo\") - func(\"bar\")"; -> Here, func should be taken as func LEXEME, ( should be LPAREN, ) should be RPAREN, foo as REGEX_STRING, - as OP and same for func(\"bar\")
my $input4 = "func(\"foo\")"; -> Here, func should be taken as func LEXEME, ( should be LPAREN, ) should be RPAREN, foo as REGEX_STRING
print "Trying to parse:\n$input\n\n";
$recce->read(\$input);
my $value_ref = ${$recce->value};
print "Output:\n".Dumper($value_ref);
What did i try :
1st method:
My REGEX_STRING should be something : REGEX_STRING -> ~ '\"([^:]*?)\"'
If i try putting above REGEX_STRING in the code with input expression as my $input4 = "func(\"foo\")"; i get error like :
Error in SLIF parse: No lexeme found at line 1, column 5
* String before error: func(
* The error was at line 1, column 5, and at character 0x0022 '"', ...
* here: "foo")
Marpa::R2 exception
2nd method:
Tried including a rule like :
PARAM ::= STRING | REGEX_STRING
REGEX_STRING ::= '"' QUOTED_STRING '"'
STRING ~ [^ \/\(\),&:\"~]+
QUOTED_STRING ~ [^ ,&:\"~]+
The problem here is-> Input is given using:
my $input4 = "func(\"foo\")";
So, here it gives error because there are now two ways to parse this expression, either whole thing between double quotes which is func(\"foo\")
is taken as QUOTED_STRING or func should be taken as func LEXEME and so on.
Please help how do i fix this thing.
use 5.026;
use strictures;
use Data::Dumper qw(Dumper);
use Marpa::R2 qw();
my $grammar = Marpa::R2::Scanless::G->new({
bless_package => 'parsetree',
source => \<<'',
:default ::= action => [values] bless => ::lhs
lexeme default = bless => ::name latm => 1
:start ::= expression
expression ::= expression OP expression
expression ::= expression COMMA expression
expression ::= func LPAREN PARAM RPAREN
expression ::= PARAM
PARAM ::= STRING | REGEXSTRING
:discard ~ sp
sp ~ [\s]+
COMMA ~ [,]
STRING ~ [^ \/\(\),&:\"~]+
REGEXSTRING ::= '"' QUOTEDSTRING '"'
QUOTEDSTRING ~ [^ ,&:\"~]+
OP ~ ' - ' | '&'
LPAREN ~ '('
RPAREN ~ ')'
func ~ 'func'
});
# say $grammar->show_rules;
for my $input (
'func(foo)&func(bar)', '"foo"', 'func("foo") - func("bar")', 'func("foo")'
) {
my $r = Marpa::R2::Scanless::R->new({
grammar => $grammar,
# trace_terminals => 1
});
$r->read(\$input);
say "# $input";
say Dumper $r->value;
}
2nd method posted in question worked for me. I just have to include :
lexeme default = latm => 1
in my code.
I am trying to match different logic expression, such as: "$a and $b" using Perl regex, here is my code:
$input =~ /^(.*)\s(and|or|==|<|>|>=|<=)\s(.*)$/ {
$arg1=$1;
$arg2=$3;
$opt=$2;
}
and my purpose is to get:
$arg1="$ARGV[0]=~/\w{4}/"
$arg2="$num_arg==1"
$opt ="and"
I want to get the exact value matched in the or expression. I don't want to do the same thing for all the cases to match one by one, and hardcode the operator.
Does anyone know how to solve the problem?
This code works for me:
$input = '$ARGV[0]=~/\w{4}/ and $num_arg==1';
if ($input=~/^(.*)\s(and|or|==|<|>|>=|<=)\s(.*)$/) {
$arg1=$1;
$arg2=$3;
$opt=$2;
print "$arg1\n$arg2\n$opt\n";
}
You need a little parser able to reveal the structure of a logical expression. That is because you may have another expression inside a term. You can use perl to test your grammar using Marpa::R2 package.
As a first attempt I would write:
<expression> ::= <term> | <expression> <binary-op> <term>
<term> ::= <factor> <binary-op> <factor> | <unary-op><factor>
<factor> ::= <id>
<binary-op> ::= (and|or|==|<|>|>=|<=)
<unary-op> ::= (not | ! )
One thing for sure that you can't complete describe the syntax of a logical expression using only regular expressions, it will always lack some valid case.
The Perl Code for validation
use Modern::Perl;
use Marpa::R2;
my $dsl = <<'END_OF_DSL';
:default ::= action => [name,values]
lexeme default = latm => 1
Expression ::= Term
| Expression BinaryOP Term
Term ::= Factor BinaryOP Factor
| UnaryOP Factor
Factor ::= ID
ID ~ [\w]+
BinaryOP ~ 'and' | 'or' | '==' | '<' | '>' | '>=' | '<='
UnaryOP ~ 'not' | '!'
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_DSL
# your input
my $input = 'a and b or !c';
# your parser
my $grammar = Marpa::R2::Scanless::G->new( { source => \$dsl } );
# process input
my $recce = Marpa::R2::Scanless::R->new(
{ grammar => $grammar, semantics_package => 'My_Actions' } );
my $length_read = $recce->read( \$input );
die "Read ended after $length_read of ", length $input, " characters"
if $length_read != length $input;
Let's consider this piece of code which does not belong to any known language:
foo() {
bar() {
bar();
}
baz() {
// Content baz
qux() {
// Content qux
}
}
}
I would like to iteratively process each function by calling a subroutine which receives: the function name, the arguments, the indentation level and the content.
So far I have written this:
#!/usr/bin/env perl
use 5.010;
$_ = do {local $/; <>};
s/([{}])/$1.($1 eq '{'?++$i:$i--).$1/eg;
parse($_);
sub parse {
local $_ = shift;
while (/(?<name>\w+)\s*\((?<args>.*?)\)\s*\{(\d)\{(?<content>.*?)\}(?<level>\3)\}/gs) {
parse($+{content});
process($+{content}, $+{args}, $+{level}, $+{content});
}
}
sub process {
my ($name, $args, $level, $content) = #_;
#...
}
The tricky idea is to replace in-place each matched brace { with an indentation number. So this:
{
{
}
}
will become this:
{1{
{2{
}2}
}1}
It allows to easily write the parsing regex which simply become:
qr/
\w+ # name
\s* \(.*?\) # arguments
\s* \{(\d)\{ # opening brace
.*? # content
\s* \}\1\} # closing brace
/x
How can I rewrite this without this trick?
Note that the choice of {1{ could be anything else like {(1), {1-, {[1] or even {1✈
You could try using recursive regular expressions. For example:
/(\{(?:[^{}]++|(?1))*\})/
will match a group of balanced braces. For more information, see
Can I use Perl regular expressions to match balanced text? in perlfaq6,
Perl documentation "Extended patterns" in perlre, and
extract_bracketed in Text::Balanced.
You will have a hard time doing this with a recursive regex because you can only get the last value of each capturing group and will lose all of the intervening values. This task is more suited to a parser like Parse::RecDescent:
use strict;
use warnings;
use 5.010;
use Parse::RecDescent;
sub process {
my ($name, $args, $depth) = #_;
say "$depth - $name($args)";
}
my $grammar = q{
{ my $indent = 0; }
startrule : expression(s /;/)
expression : function_call
| function_def[$indent]
function_call : identifier '(' arglist ')' ';'
function_def : identifier '(' arglist ')' '{' expression[ $arg[0]++ ](s?) '}'
{ main::process( $item{identifier}, join(',', #$item{arglist}), $arg[0] ) }
arglist : identifier(s? /,/)
identifier : /\w+/
};
# Tell parser to ignore spaces and C99 one-line comments
my $skip_spaces_and_comments = qr{ (?: \s+ | // .*?$ )* }mxs;
$Parse::RecDescent::skip = $skip_spaces_and_comments;
my $parser = Parse::RecDescent->new($grammar) or die 'Bad grammar';
my $text = do { local $/; <DATA> };
defined $parser->startrule($text) or die 'Failed to parse';
__DATA__
foo() {
bar() {
bar();
}
baz() {
// Content baz
qux() {
// Content qux
}
}
}
Output:
2 - bar()
1 - qux()
2 - baz()
3 - foo()
Note that the depths are inverted (1 is the most nested) and this doesn't return the contents of the function definitions, but it should get you started.
I'm trying to munge a simple grammar with a perl regex (note this isn’t intended for production use, just a quick analysis for providing editor hints/completions). For instance,
my $GRAMMAR = qr{(?(DEFINE)
(?<expr> \( (?&expr) \) | (?&number) | (?&var) | (?&expr) (?&op) (?&expr) )
(?<number> \d++ )
(?<var> [a-z]++ )
(?<op> [-+*/] )
)}x;
I would like to be able to run this as
$expr =~ /$GRAMMAR(?&expr)/;
and then access all the variable names. However, according to perlre,
Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus $+{NAME_PAT} would not be defined even though $+{NAME} would be.
So apparently this is not possible. I could try using a (?{ code }) block to save variable names to a hash, but this doesn't respect backtracking (i.e. the assignment’s side effect persists even if the variable is backtracked past).
Is there any way to get everything captured by a given named capture group, including recursive matches? Or do I need to manually dig through the individual pieces (and thus duplicate all the patterns)?
The necessity of having to add capturing and backtracking machinery is one of the shortcomings that Regexp::Grammars addresses.
However, the grammar in your question is left-recursive, which neither Perl regexes nor a recursive-descent parser will parse.
Adapting your grammar to Regexp::Grammars and factoring out left-recursion produces
my $EXPR = do {
use Regexp::Grammars;
qr{
^ <Expr> $
<rule: Expr> <Term> <ExprTail>
| <Term>
<rule: Term> <Number>
| <Var>
| \( <MATCH=Expr> \)
<rule: ExprTail> <Op> <Expr>
<token: Op> \+ | \- | \* | \/
<token: Number> \d++
<token: Var> [a-z]++
}x;
};
Note that this simple grammar gives all operators equal precedence rather than Please Excuse My Dear Aunt Sally.
You want to extract all variable names, so you could walk the AST as in
sub all_variables {
my($root,$var) = #_;
$var ||= {};
++$var->{ $root->{Var} } if exists $root->{Var};
all_variables($_, $var) for grep ref $_, values %$root;
wantarray ? keys %$var : [ keys %$var ];
}
and print the result with
if ("(a + (b - c))" =~ $EXPR) {
print "[$_]\n" for sort +all_variables \%/;
}
else {
print "no match\n";
}
Another approach is to install an autoaction for the Var rule that records names of variables as they are successfully parsed.
package JustTheVarsMaam;
sub new { bless {}, shift }
sub Var {
my($self,$result) = #_;
++$self->{VARS}{$result};
$result;
}
sub all_variables { keys %{ $_[0]->{VARS} } }
1;
Call this one as in
my $vars = JustTheVarsMaam->new;
if ("(a + (b - c))" =~ $EXPR->with_actions($vars)) {
print "[$_]\n" for sort $vars->all_variables;
}
else {
print "no match\n";
}
Either way, the output is
[a]
[b]
[c]
Recursivity is native with Marpa::R2 using the BNF in the __DATA__ section below:
#!env perl
use strict;
use diagnostics;
use Marpa::R2;
my $input = shift || '(a + (b - c))';
my $grammar_source = do {local $/; <DATA>};
my $recognizer = Marpa::R2::Scanless::R->new
(
{
grammar => Marpa::R2::Scanless::G->new
(
{
source => \$grammar_source,
action_object => __PACKAGE__,
}
)
},
);
my %vars = ();
sub new { return bless {}, shift;}
sub varAction { ++$vars{$_[1]}};
$recognizer->read(\$input);
$recognizer->value() || die "No parse";
print join(', ', sort keys %vars) . "\n";
__DATA__
:start ::= expr
expr ::= NUMBER
| VAR action => varAction
| expr OP expr
| '(' expr ')'
NUMBER ~ [\d]+
VAR ~ [a-z]+
OP ~ [-+*/]
WS ~ [\s]+
:discard ~ WS
The output is:
a, b, c
Your question was adressing only how to get the variable names, so no notion of operator associativity and so on in this answer. Just note that Marpa has no problem with that, if needed.
I have a string "CamelCase", I use this RegEx :
string pattern = "(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])";
string[] substrings = Regex.Split("CamelCase", pattern);
In substring, I have Camel and Case, that's find, but I'd like all in uppercase like this CAMEL and CASE. Better, I'd like get a string like this CAMEL_CASE but pease ALL with Regex.
Here is a JavaScript implementation.
function camelCaseToUpperCase(str) {
return str.replace(/([a-z])([A-Z])/, '$1_$2').toUpperCase();
}
Demo
printList([ 'CamelCase', 'camelCase' ],
function(value, idx, values) {
return value + ' -> '
+ camelCaseToUpperCase(value) + ' -> '
+ camelToTitle(value, '_');
}
);
// Case Conversion Functions
function camelCaseToUpperCase(str) {
return str.replace(/([a-z])([A-Z])/, '$1_$2').toUpperCase();
}
function camelToTitle(str, delimiter) {
return str.replace(/([A-Z][a-z]+)/g, ' $1') // Words beginning with UC
.replace(/([A-Z][A-Z]+)/g, ' $1') // "Words" of only UC
.replace(/([^A-Za-z ]+)/g, ' $1') // "Words" of non-letters
.trim() // Remove any leading/trailing spaces
.replace(/[ ]/g, delimiter || ' '); // Replace all spaces with the delim
}
// Utility Functions
function printList(items, conversionFn) {
var str = '<ul>';
[].forEach.call(items, function(item, index) {
str += '<li>' + conversionFn(item, index, items) + '</li>';
});
print(str + '</ul>');
}
function print() {
write.apply(undefined, arguments);
}
function println() {
write.apply(undefined, [].splice.call(arguments,0).concat('<br />'));
}
function write() {
document.getElementById('output').innerHTML += arguments.length > 1 ?
[].join.call(arguments, ' ') : arguments[0]
}
#output {
font-family: monospace;
}
<h1>Case Conversion Demo</h1>
<div id="output"></div>
In Perl you can do this:
$string = "CamelCase";
$string =~ s/((?<=[a-z])[A-Z][a-z]+)/_\U$1/g;
$string =~ s/(\b[A-Z][a-z]+)/\U$1/g;
print "$string\n";
The replacement uses \U to convert the found group to uppercase.
That can be compressed into a single regex using Perl's e option to evaluate a replacement:
$string = "CamelCase";
$string =~ s/(?:\b|(?<=([a-z])))([A-Z][a-z]+)/(defined($1) ? "_" : "") . uc($2)/eg;
print "$string\n";
Using sed and tr unix utilities (from your terminal)...
echo "CamelCase" | sed -e 's/\([A-Z]\)/-\1/g' -e 's/^-//' | tr '-' '_' | tr '[:lower:]' '[:upper:]'
If you have camel case strings with "ID" at the end and you'd like to keep it that way, then use this one...
echo "CamelCaseID" | sed -e 's/\([A-Z]\)/-\1/g' -e 's/^-//' | tr '-' '_' | tr '[:lower:]' '[:upper:]' | sed -e 's/I_D$/ID/g'
By extending the String class in ruby...
class String
def camelcase_to_underscore
self.gsub(/::/, '/').
gsub(/([A-Z]+)([A-Z][a-z])/,'\1_\2').
gsub(/([a-z\d])([A-Z])/,'\1_\2').
tr("-", "_").
upcase
end
end
Now, you can execute the camelcase_to_underscore method on any string. Example:
>> "CamelCase".camelcase_to_underscore
=> "CAMEL_CASE"
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = "CamelCase";
string output = Regex.Replace(input,
#"(?:\b|(?<=([A-Za-z])))([A-Z][a-z]*)",
m => string.Format(#"{0}{1}",
(m.Groups[1].Value.Length > 0)? "_" : "", m.Groups[2].Value.ToUpper()));
Console.WriteLine(output);
}
}
Test this code here.