Parsing Challenge - Fixing Broken Syntax - regex

I have thousands of lines of code making use of a particular non-standard syntax. I need to be able to compile the code with a different compiler, that does not support this syntax. I have tried to automatize the changes that need to be made, but being not very good with regex etc. I have failed.
Here is what I want to achieve: currently in my code an object's methods and variables are called/accessed with the following possible syntaxes:
call obj.method()
obj.method( )
obj.method( arg1, arg2, kwarg1=kwarg1 )
obj1.var = obj2.var2
Instead I want this to be:
call obj%method()
obj%method( )
obj%method( arg1, arg2, kwarg1=kwarg1 )
obj1%var = obj2%var2
And I want to make these changes without effecting the following possible occurrences of "."s:
Decimal numbers:
a = 1.0
b = 1.d0
Logical opertors (note possible spaces and method calls):
if (a.or.b) then
if ( a .and. .not.(obj.l1(1.d0)) ) then
Anything that is commented (the exclamation point "!" is used for this purpose)
!>I am a commented line.
! > I am.a commented line with..leading blanks and extra periods.1.
b=a1.var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
Anything that is in quotes (i.e. string literals)
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
Does anyone know how to approach this. I guess regex is the natural approach, but I am open to anything. (In case anyone cares: the code is written in fortran. ifort is happy with the "." syntax; gfortran isn't)

Have you looked into solving the problem with flex? It uses regular expressions, but is more advanced, as it tries different patterns and returns the longest matching option. The rules could look like this:
%% /* rule part of the program */
!.*\n printf(yytext); /* ignore comments */
\".*\"|'.*' printf(yytext); /* ignore strings */
[^A-Za-z_][0-9]+\. printf(yytext); /* ignore numbers */
".and."|".or."|".not." printf(yytext); /* ignore logical operators */
\. printf("%%"); /* now, replace the . by % */
[^\.] printf(yytext); /* ignore everything else */
%% /* invoke the program */
int main() {
yylex();
}
You may have to modify the third line. Currently it ignores any . that occurs after any number of digits, if there is none of the characters from A to Z, from a to z or the character _ before the digits. If there are more legal characters in identifiers, you can add them.
If everything is correct, you should be able to turn that into a program. Copy it into a file called lex.l and execute:
$ flex -o lex.yy.c lex.l
$ gcc -o lex.out lex.yy.c -lfl
Then you have the C program lex.out. You can just use that in the command line:
cat unreplaced.txt | ./lex.out > replaced.txt
This uses the same principle as Ed Mortons suggestion, but it uses flex so we can skip the organization. It still fails in some cases like having \" inside strings.
Sample input
call obj.method()
obj.method( )
obj.method( arg1, arg2, kwarg1=kwarg1 )
obj1.var = obj2.var2
a = 1.0
b = 1.d0
if (a.or.b) then
if ( a .and. .not.(obj.l1(1.d0)) ) then
!>I am a commented line.
! > I am.a commented line with..leading blanks and extra periods.1.
b=a1.var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
c="I am an exclaimed string!"; b=a1.var()
Output
call obj%method()
obj%method( )
obj%method( arg1, arg2, kwarg1=kwarg1 )
obj1%var = obj2%var2
a = 1.0
b = 1.d0
if (a.or.b) then
if ( a .and. .not.(obj%l1(1.d0)) ) then
!>I am a commented line.
! > I am.a commented line with..leading blanks and extra periods.1.
b=a1%var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
c="I am an exclaimed string!"; b=a1%var()

You can't do this 100% robustly without a language parser (e.g. the following will fail in some cases if you have \" inside double quoted strings - easily handled but just one of many possible failures not covered by your use cases) but this will handle what you've shown us so far and a bit more. It uses GNU awk for gensub() and the 3rd arg to match().
Sample Input:
$ cat file
call obj.method()
obj.method( )
obj.method( arg1, arg2, kwarg1=kwarg1 )
obj1.var = obj2.var2
a = 1.0
b = 1.d0
if (a.or.b) then
if ( a .and. .not.(obj.l1(1.d0)) ) then
!>I am a commented line.
! > I am.a commented line with..leading blanks and extra periods.1.
b=a1.var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
c="I am an exclaimed string!"; b=a1.var()
Expected Output:
$ cat out
call obj%method()
obj%method( )
obj%method( arg1, arg2, kwarg1=kwarg1 )
obj1%var = obj2%var2
a = 1.0
b = 1.d0
if (a.or.b) then
if ( a .and. .not.(obj%l1(1.d0)) ) then
!>I am a commented line.
! > I am.a commented line with..leading blanks and extra periods.1.
b=a1%var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
c="I am an exclaimed string!"; b=a1%var()
The Script:
$ cat tst.awk
{
# give us the ability to use #<any other char> strings as a
# replacement/placeholder strings that cannot exist in the input.
gsub(/#/,"#=")
# ignore all !s inside double-quoted strings
while ( match($0,/("[^"]*)!([^"]*")/,a) ) {
$0 = substr($0,1,RSTART-1) a[1] "#-" a[2] substr($0,RSTART+RLENGTH)
}
# ignore all !s inside single-quoted strings
while ( match($0,/('[^']*)!([^']*')/,a) ) {
$0 = substr($0,1,RSTART-1) a[1] "#-" a[2] substr($0,RSTART+RLENGTH)
}
# Now we can separate comments from what comes before them
comment = gensub(/[^!]*/,"",1)
$0 = gensub(/!.*/,"",1)
# ignore all .s inside double-quoted strings
while ( match($0,/("[^"]*)\.([^"]*")/,a) ) {
$0 = substr($0,1,RSTART-1) a[1] "##" a[2] substr($0,RSTART+RLENGTH)
}
# ignore all .s inside single-quoted strings
while ( match($0,/('[^']*)\.([^']*')/,a) ) {
$0 = substr($0,1,RSTART-1) a[1] "##" a[2] substr($0,RSTART+RLENGTH)
}
# convert all logical operators like a.or.b to a##or##b so the .s wont get replaced later
while ( match($0,/\.([[:alpha:]]+)\./,a) ) {
$0 = substr($0,1,RSTART-1) "##" a[1] "##" substr($0,RSTART+RLENGTH)
}
# convert all obj.var and similar to obj%var, etc.
while ( match($0,/\<([[:alpha:]]+[[:alnum:]_]*)[.]([[:alpha:]]+[[:alnum:]_]*)\>/,a) ) {
$0 = substr($0,1,RSTART-1) a[1] "%" a[2] substr($0,RSTART+RLENGTH)
}
# Convert all ##s in the precomment text back to .s
gsub(/##/,".")
# Add the comment back
$0 = $0 comment
# Convert all #-s back to !s
gsub(/#-/,"!")
# Convert all #=s back to #s
gsub(/#=/,"#")
print
}
Running The Script And Its Output:
$ awk -f tst.awk file
call obj%method()
obj%method( )
obj%method( arg1, arg2, kwarg1=kwarg1 )
obj1%var = obj2%var2
a = 1.0
b = 1.d0
if (a.or.b) then
if ( a .and. .not.(obj%l1(1.d0)) ) then
!>I am a commented line.
! > I am.a commented line with..leading blanks and extra periods.1.
b=a1%var( 0.d0 ) !! I contain a commented version of this line: b=a1.var( 0.d0 )
c = "I am a string"
c= 'I am an obnoxious string: b=a1.var( 0.d0 ) ... '
c="I am an exclaimed string!"; b=a1%var()

Related

Replace % in the end of string with Regex?

I want to remove " %" at the end of some text. I'd like to do it with a regular expression, because ABAP does not easily handle text at the end of a string.
DATA lv_vtext TYPE c LENGTH 10 VALUE 'TEST %'.
REPLACE REGEX ' %$' IN lv_vtext WITH ''.
But it does not replace anything. When I leave out "$" the text will be removed as expected, but I fear it might find more occurrences than wanted.
I experimented with \z or \Z instead of $, but to no avail.
This answer is about an alternative way without REGEX. POSIX regular expressions are quite slow, moreover some people are reluctant to use it, so if you're not completely closed to do it in normal ABAP:
lv_vtext = COND #( WHEN contains( val = lv_vtext end = ` %` )
THEN substring( val = lv_vtext len = strlen( lv_vtext ) - 2 )
ELSE lv_vtext ).
Code with context:
DATA(lv_vtext) = `test %`.
lv_vtext = COND #( WHEN contains( val = lv_vtext end = ` %` )
THEN substring( val = lv_vtext len = strlen( lv_vtext ) - 2 )
ELSE lv_vtext ).
ASSERT lv_vtext = `test`.
You can use
REPLACE REGEX '\s%\s*$' IN lv_vtext WITH ''
The benefit of using \s is that it matches any Unicode whitespace chars. The \s*$ matches any trailing (white)spaces that you might have missed.
The whole pattern matches
\s - any whitespace
% - a % char
\s* - zero or more whitespaces
$ - at the end of string.

How to replace character at even position by ( and odd position by )

Is it possible to write down a regular expression such that the first $ sign will be replaced by a (, the second with a ), the third with a (, etc ?
For instance, the string
This is an $example$ of what I want, $ 1+1=2 $ and $ 2+2=4$.
should become
This is an (example) of what I want, ( 1+1=2 ) and ( 2+2=4).
Sort of an indirect solution, but in some languages, you can use a callback function for the replacement. You can then cycle through the options in that function. This would also work with more than two options. For example, in Python:
>>> text = "This is an $example$ of what I want, $ 1+1=2 $ and $ 2+2=4$."
>>> options = itertools.cycle(["(", ")"])
>>> re.sub(r"\$", lambda m: next(options), text)
'This is an (example) of what I want, ( 1+1=2 ) and ( 2+2=4).'
Or, if those always appear in pairs, as it seems to be the case in your example, you could match both $ and everything in between, and then replace the $ and reuse the stuff in between using a group reference \1; but again, not all languages support those:
>>> re.sub(r"\$(.*?)\$", r"(\1)", text)
'This is an (example) of what I want, ( 1+1=2 ) and ( 2+2=4).'
According to an answer already posted here https://stackoverflow.com/a/13947249/6332575 in Ruby you can use
yourstring.gsub("$").with_index(1){|_, i| i.odd? ? "(" : ")"}
In JavaScript:
function replace$(str) {
let first = false;
return str.replace(/\$/, _ => (first = !first) ? '(' : ')');
}
In R, you can use str_replace, which only replaces the first match, and a while loop to deal with pairs of matches at a time.
# For str_*
library(stringr)
# For the pipes
library(magrittr)
str <- "asdfasdf $asdfa$ asdfasdf $asdf$ adsfasdf$asdf$"
while(any(str_detect(str, "\\$"))) {
str <- str %>%
str_replace("\\$", "(") %>%
str_replace("\\$", ")")
}
It's not the most efficient solution, probably, but it will go through and replace $ with ( and ) through the whole string.

In Vim, how to auto format current line when i typed semicolon in c++ files

For example: current line is
int i=0
After i typed semicolon,
int i = 0;
You'll need an inoremap on ; (and restricted to the current buffer -- :h :map-<buffer>). On the mapping, you'll have to parse the current line for (exactly?) one = sign. Reformat and be sure to move the cursor back.
The traps:
should the mapping be redoable? In that case you'll have to move the cursor with <c-g>U<left> and <right> as many times as required. Otherwise, a plain use of getline() + substitute() + setline() would work (and be quite simple actually) -> :call setline('.', substitute(getline('.'), '\s*[<>!=]\?=\s*', ' = ', 'g')).
if UTF-8 characters could appear like in auto head = "tĂȘte";, you won't be able to usestrlen()`.
Note, that clang-format is may be already able to do this kind of stuff -- but I cannot guarantee it permits to produce re-doable sequences.
" rtp/ftplugin/c/c_reformat_on_semincolon.vim
" The reodable version.
function! s:semicolon() abort
" let suppose the ';' is at the end of the line...
" and that there aren't multiple instructions of the same line
let res = ''
let line = getline('.')
let missing_before = match(line, '\V\S=')
let c = col('.') - 1
if missing_before >= 0
" prefer lh-vim-lib lh#encoding#strlen(line[missing_before : c])
let offset = c - missing_before " +/- 1 ?
let res .= repeat("\<c-g>U\<right>", offset)
\ . ' '
if line =~ '\V=\S'
let res .= "\<c-g>U\<left> "
let offset -= 1 " or 2 ?
endif
" let offset +/-= 1 ??
else
let offset = c - missing_after" +/- 1 ?
let res .= repeat("\<c-g>U\<right>", offset)
\ . ' '
endif
let res .= repeat("\<c-g>U\<left> ", offset)
return res
endfunction
inoremap <buffer> ; <c-r>=<sid>semicolon()<cr>
Note that this code is completely untested. You'll have to adjust offsets, and may be fix the logic.

How to invoke a program and pass it standard input

How do I invoke a program and pass it standard input? Hypothetical example in Bash:
(echo abc; echo abba) | tr b B
Note that:
I don't have the input in a string (I'm generating it as I iterate)
I don't know how long input is
The input may span multiple lines, as in this example
I've written this in 19 other languages already, the way I usually approach it is to get a file descriptor for the program's standard input, and then write to the file descriptor the same way I would write to standard output.
What I've tried so far: Based on Invoke external program and pass arguments I tried passing it to echo and using the shell to handle the piping. This doesn't work if my input has single quotes in it, and it doesn't work if I don't have my input in a string (which I don't)
Here is my code, currently trying to pull it off by calculating the string that will be printed (it fails right now).
As for single quotation, how about inserting the escape character (\) before each quotation mark (')...? It seems to be working somehow for the "minimal" example:
module utils
implicit none
contains
function esc( inp ) result( ret )
character(*), intent(in) :: inp
character(:), allocatable :: ret
integer :: i
ret = ""
do i = 1, len_trim( inp )
if ( inp( i:i ) == "'" ) then
ret = ret // "\'"
else
ret = ret // inp( i:i )
endif
enddo
endfunction
endmodule
program test
use utils
implicit none
character(100) :: str1, str2
integer i
call getcwd( str1 ) !! to test a directory name containing single quotes
str2 = "ab'ba"
print *, "trim(str1) = ", trim( str1 )
print *, "trim(str2) = ", trim( str2 )
print *
print *, "esc(str1) = ", esc( str1 )
print *, "esc(str2) = ", esc( str2 )
print *
print *, "using raw str1:"
call system( "echo " // trim(str1) // " | tr b B" )
print *
print *, "using raw str2:"
call system( "echo " // trim(str2) // " | tr b B" ) !! error
print *
print *, "using esc( str1 ):"
call system( "echo " // esc(str1) // " | tr b B" )
print *
print *, "using esc( str2 ):"
call system( "echo " // esc(str2) // " | tr b B" )
print *
print *, "using esc( str1 ) and esc( str2 ):"
call system( "(echo " // esc(str1) // "; echo " // esc(str2) // ") | tr b B" )
! pbcopy (copy to pasteboard; mac-only)
! call system( "(echo " // esc(str1) // "; echo " // esc(str2) // ") | pbcopy" )
endprogram
If we run the above program in the directory /foo/baa/x\'b\'z, we obtain
trim(str1) = /foo/baa/x'b'y
trim(str2) = ab'ba
esc(str1) = /foo/baa/x\'b\'y
esc(str2) = ab\'ba
using raw str1:
/foo/baa/xBy
using raw str2:
sh: -c: line 0: unexpected EOF while looking for matching `''
sh: -c: line 1: syntax error: unexpected end of file
using esc( str1 ):
/foo/baa/x'B'y
using esc( str2 ):
aB'Ba
using esc( str1 ) and esc( str2 ):
/foo/baa/x'B'y
aB'Ba
This is what I mean by a minimal example.
call execute_command_line("(echo ""hexxo world"";echo abba)|tr x l")
end
hello world
abba
Is this not doing exactly what you ask, invoking tr and passing standard input?

How to get multiple occurrence in a regular expression when a pattern contains the same pattern within itself?

I would like to get first occurrence in a regular expression, but not embedded ones.
For example, Regular Expression is:
\bTest\b\s*\(\s*\".*\"\s*,\s*\".*\"\s*\)
Sample text is
x == Test("123" , "ABC") || x == Test ("123" , "DEF")
Result:
Test("123" , "ABC") || x == Test ("123" , "DEF")
Using any regular expression tool (Expresso, for example), I am getting the whole text as the result, as it satisfies the regular expression. Is there a way to get the result in two parts as shown below.
Test("123" , "ABC")
and
Test ("123" , "DEF")
Are you trying to parse code with regex? This is always going to be a fairly brittle solution, and you should consider using an actual parser.
That said, to solve your immediate problem, you want to use non-greedy matching - the *? quantifier instead of just the *.
Like so:
\bTest\b\s*\(\s*\".*?\"\s*,\s*\".*?\"\s*\)
A poor mans C function parser, in Perl.
## ===============================================
## C_FunctionParser_v3.pl # 3/21/09
## -------------------------------
## C/C++ Style Function Parser
## Idea - To parse out C/C++ style functions
## that have parenthetical closures (some don't).
## - sln
## ===============================================
my $VERSION = 3.0;
$|=1;
use strict;
use warnings;
# Prototype's
sub Find_Function(\$\#);
# File-scoped variables
my ($FxParse, $FName, $Preamble);
# Set function name, () gets all functions
SetFunctionName('Test'); # Test case, function 'Test'
## --------
# Source file
my $Source = join '', <DATA>;
# Extended, possibly non-compliant,
# function name - pattern examples:
# (no capture groups in function names strings or regex!)
# - - -
# SetFunctionName( qr/_T/ );
# SetFunctionName( qr/\(\s*void\s*\)\s*function/ );
# SetFunctionName( "\\(\\s*void\\s*\\)\\s*function" );
# Parse some functions
my #Funct = ();
Find_Function( $Source, #Funct );
# Print functions found
# (segments can be modified and/or collated)
if ( !#Funct ) {
print "Function name pattern: '$FName' not found!\n";
} else {
print "\nFound ".#Funct." matches.\nFunction pattern: '$FName' \n";
}
for my $ref (#Funct) {
# Format; #: Line number - function
printf "\n\#: %6d - %s\n", $$ref[3], substr($Source, $$ref[0], $$ref[2] - $$ref[0]);
}
exit;
## End
# ---------
# Set the parser's function regex pattern
#
sub SetFunctionName
{
if (!#_) {
$FName = "_*[a-zA-Z][\\w]*"; # Matches all compliant function names (default)
} else {
$FName = shift; # No capture groups in function names please
}
$Preamble = "\\s*\\(";
# Compile function parser regular expression
# Regex condensed:
# $FxParse = qr!//(?:[^\\]|\\\n?)*?\n|/\*.*?\*/|\\.|'["()]'|(")|($FName$Preamble)|(\()|(\))!s;
# | | | |1 1|2 2|3 3|4 4
# Note - Non-Captured, matching items, are meant to consume!
# -----------------------------------------------------------
# Regex /xpanded (with commentary):
$FxParse = # Regex Precedence (items MUST be in this order):
qr! # -----------------------------------------------
// # comment - //
(?: # grouping
[^\\] # any non-continuation character ^\
| # or
\\\n? # any continuation character followed by 0-1 newline \n
)*? # to be done 0-many times, stopping at the first end of comment
\n # end of comment - //
| /\*.*?\*/ # or, comment - /* + anything + */
| \\. # or, escaped char - backslash + ANY character
| '["()]' # or, single quote char - quote then one of ", (, or ), then quote
| (") # or, capture $1 - double quote as a flag
| ($FName$Preamble) # or, capture $2 - $FName + $Preamble
| (\() # or, capture $3 - ( as a flag
| (\)) # or, capture $4 - ) as a flag
!xs;
}
# Procedure that finds C/C++ style functions
# (the engine)
# Notes:
# - This is not a syntax checker !!!
# - Nested functions index and closure are cached. The search is single pass.
# - Parenthetical closures are determined via cached counter.
# - This precedence avoids all ambigous paranthetical open/close conditions:
# 1. Dual comment styles.
# 2. Escapes.
# 3. Single quoted characters.
# 4. Double quotes, fip-flopped to determine closure.
# - Improper closures are reported, with the last one reliably being the likely culprit
# (this would be a syntax error, ie: the code won't complie, but it is reported as a closure error).
#
sub Find_Function(\$\#)
{
my ($src, $Funct) = #_;
my #Ndx = ();
my #Closure = ();
my ($Lines, $offset, $closure, $dquotes) = (1,0,0,0);
while ($$src =~ /$FxParse/xg)
{
if (defined $1) # double quote "
{
$dquotes = !$dquotes;
}
next if ($dquotes);
if (defined $2) # 'function name'
{
# ------------------------------------
# Placeholder for exclusions......
# ------------------------------------
# Cache the current function index and current closure
push #Ndx, scalar(#$Funct);
push #Closure, $closure;
my ($funcpos, $parampos) = ( $-[0], pos($$src) );
# Get newlines since last function
$Lines += substr ($$src, $offset, $funcpos - $offset) =~ tr/\n//;
# print $Lines,"\n";
# Save positions: function( parms )
push #$Funct , [$funcpos, $parampos, 0, $Lines];
# Asign new offset
$offset = $funcpos;
# Closure is now 1 because of preamble '('
$closure = 1;
}
elsif (defined $3) # '('
{
++$closure;
}
elsif (defined $4) # ')'
{
--$closure;
if ($closure <= 0)
{
$closure = 0;
if (#Ndx)
{
# Pop index and closure, store position
$$Funct[pop #Ndx][2] = pos($$src);
$closure = pop #Closure;
}
}
}
}
# To test an error, either take off the closure of a function in its source,
# or force it this way (pseudo error, make sure you have data in #$Funct):
# push #Ndx, 1;
# Its an error if index stack has elements.
# The last one reported is the likely culprit.
if (#Ndx)
{
## BAD, RETURN ...
## All elements in stack have to be fixed up
while ( #Ndx ) {
my $func_index = shift #Ndx;
my $ref = $$Funct[$func_index];
$$ref[2] = $$ref[1];
print STDERR "** Bad return, index = $func_index\n";
print "** Error! Unclosed function [$func_index], line ".
$$ref[3].": '".substr ($$src, $$ref[0], $$ref[2] - $$ref[0] )."'\n";
}
return 0;
}
return 1
}
__DATA__
x == Test("123" , "ABC") || x == Test ("123" , "DEF")
Test("123" , Test ("123" , "GHI"))?
Test("123" , "ABC(JKL)") || x == Test ("123" , "MNO")
Output (line # - function):
Found 6 matches.
Function pattern: 'Test'
#: 1 - Test("123" , "ABC")
#: 1 - Test ("123" , "DEF")
#: 2 - Test("123" , Test ("123" , "GHI"))
#: 2 - Test ("123" , "GHI")
#: 3 - Test("123" , "ABC(JKL)")
#: 3 - Test ("123" , "MNO")