Stripping comments from Forth source code using regular expressions - regex

I am trying to match all content between parentheses, including parentheses in a non-greedy way. There should be a space before and after the opening parentheses (or the start of a line before the opening parentheses) and a space before and after the closing parentheses. Take the following text:
( )
( This is a comment )
1 2 +
\ a
: square dup * ;
( foo bar
baz )
(quux)
( ( )
(
( )
The first line should be matched, the second line including its content should be matched, the second last line should not be matched (or raise an error) and the last line should be matched. The two lines foo bar baz should be matched, but (quux) should not as it doesn't contain a space before and after the parentheses. The line with the extra opening parentheses inside should be matched.
I tried a few conventional regexes for matching content between parentheses but without much success. The regex engine is that of Go's.

re := regexp.MustCompile(`(?s)\(( | .*? )\)`)
s = re.ReplaceAllString(s, "")
Playground: https://play.golang.org/p/t93tc_hWAG

Regular expressions "can't count" (that's over-simplified, but bear with me), so you can't match on an unbounded amount of parenthesis nesting. I guess you're mostly concerned about matching only a single level in this case, so you would need to use something like:
foo := regexp.MustCompile(`^ *\( ([^ ]| [^)]*? \)$`)
This does require the comment to be the very last thing on a line, so it may be better to add "match zero or more spaces" there. This does NOT match the string "( ( ) )" or try to cater for arbitrary nesting, as that's well outside the counting that regular expressions can do.
What they can do in terms of counting is "count a specific number of times", they can't "count how many blah, then make sure there's the same number of floobs" (that requires going from a regular expression to a context-free grammar).
Playground

Here is a way to match all the 3 lines in question:
(?m)^[\t\p{Zs}]*\([\pZs}\t](?:[^()\n]*[\pZs}\t])?\)[\pZs}\t]*$
See the Go regex demo at the new regex101.com
Details:
(?m) - multiline mode on
^ - due to the above, the start of a line
[\t\p{Zs}]* - 0+ horizontal whitespaces
\( - a (
[\pZs}\t] - exactly 1 horizontal whitespace
(?:[^()\n]*[\pZs}\t])? - an optional sequence matching:
[^()\n]* - a negated character class matching 0+ characters other than (, ) and a newline
[\pZs}\t] - horizontal whitespace
\) - a literal )
[\pZs}\t]* - 0+ horizontal whitespaces
$ - due to (?m), the end of a line.
Go playground demo:
package main
import (
"regexp"
"fmt"
)
func main() {
var re = regexp.MustCompile(`(?m)^[\t\p{Zs}]*\([\pZs}\t](?:[^()\n]*[\pZs}\t])?\)[\pZs}\t]*$`)
var str = ` ( )
( This is a comment )
1 2 +
\ a
: square dup * ;
( foo bar
baz )
(quux)
( ( )
(
( )`
for i, match := range re.FindAllString(str, -1) {
fmt.Println("'", match, "' (found at index", i, ")")
}
}

Related

Regex: Match every occurrence of dash in between two specific symbols

I have this string: ](-test-word-another)
I'm trying to find every single occurrence of - in between ] and )
Basically the return should be: ](-test-word-another)
Currently I have (?<=\]\()(-)(?=\)) but that just finds if there is only one -
Thank you in advance
Try this: /(?<=\].*)-(?=.*\))/gm
Test here: https://regex101.com/r/xkmTZs/3
This basically matches all - only if they occur after a ] and before ).
You can use
text.replace(/]\([^()]*\)/g, (x) => x.replace(/-/g, '_'))
See the demo below:
const text = 'Word-word ](-test-word-another) text-text-.';
console.log( text.replace(/]\([^()]*\)/g, (x) => x.replace(/-/g, '_')) );
The ]\([^()]*\) regex matches a ]( substring, then any zero or more chars other than ( and ) and then a ), and then all - chars inside the match value (x here) are replaced with _ using (x) => x.replace(/-/g, '_').
Another, single regex solution can look like
(?<=]\((?:(?!]\()[^)])*)-(?=[^()]*\))
See this regex demo. It matches any - that is immediately preceded with ]( and any zero or more chars other than ) that does not start a ]( char sequence, and is immediately followed with any zero or more chars other than ( and ) and then a ) char.

Regex: Deal \r\n as normal word

I'm doing a small project which can calculate the count of functions in C++ files(.cpp).
I used the following Regex as "function pattern":
/[a-z|A-Z]+\s*::\s*~?[a-z|A-Z]+\(.*\)/gm
It works for most cases, but fails when there are new line breaks in ().
void CXYZRScanPanel::OnPrepareScanning()
{
//This one is ok.
}
void CXYZRScanPanel::OnPrepareScanning(int k)
{
//This one is ok.
}
void CXYZRScanPanel::OnPrepareScanning(int k,
int j)
{
//This one fails.
}
I'm thinking if there is anything "stronger" than the .* which can skip the \r\n.
Thanks for any help.
If there is no such a thing, I will probably remove all /r/n within () before doing the such.
You could write the pattern using a negated character class starting with [^ matching any char except ( and ) which will also match a newline.
Note that you can omit the | in the character class.
[a-zA-Z]+\s*::\s*~?[a-zA-Z]+(\([^()]*\))
The pattern matches:
[a-zA-Z]+ Match 1+ times chars a-zA-Z
\s*::\s* Match :: between optional whitespace chars
~? Match an optional ~ char
[a-zA-Z]+ Match 1+ times chars a-zA-Z
( Capture group 1
\([^()]*\) Optionally match any char except ( and ) between parenthesis
) Close group 1
See a regex demo

Extracting inner groups with regex

I have the following string
([Valor][Corr][Fat]: 6M UC x Viz. Lógicos IN('3','6')) AND (((SUM_RevisionAnomalia_UltRevision_1M = 1) AND (CANT_ConsumoFact_UltRevision_1M > 1)) OR ((SUM_RevisionNoAnomalia_UltRevision_1M + 1) AND (CANT_ConsumoFact_UltRevision_1M BETWEEN 1 - 2))) OR (SUM_RevisionNoAnomalia_UltRevision_1M <= 1)
and I am trying to extract all inner groups, so my answer should contain
([Valor][Corr][Fat]: 6M UC x Viz. Lógicos IN('3','6'))
(SUM_RevisionAnomalia_UltRevision_1M = 1)
(CANT_ConsumoFact_UltRevision_1M > 1)
(SUM_RevisionNoAnomalia_UltRevision_1M + 1)
(CANT_ConsumoFact_UltRevision_1M BETWEEN 1 - 2)
(SUM_RevisionNoAnomalia_UltRevision_1M <= 1)
It is quite easy to extract this when there is only 1 set of those strings inside parentheses, but when given the example above my regex captures the whole string.
The regex i am using is
/(\([a-zA-Z0-9\[\]:_+=-\s\.\(\),'óáéíúüçãôàäê><]+\))/g
It seems you just want to match what is in-between ( and ) that is not ( and ) unless these are (...) that are preceded with a word character.
You can use
\((?:[^()]|\b\([^()]*\))*\)
See the regex demo
The regex breakdown:
\( - matching a literal (
(?:[^()]|\b\([^()]*\))* - zero or more sequences of:
[^()] - any character other than ( and )
| - or...
\b\([^()]*\) - a word boundary (i.e. before that position, there must be a word character) followed with ( followed with zero or more characters other than ( and )
\) - a closing )
An alternative pattern can be an unrolled one (more efficient with longer inputs):
\([^()]*(?:\b\([^()]*\)[^()]*)*\)
See another demo

PCRE regex for multiple decimal coordinates using [lon,lat] format

I am trying to create a regex for [lon,lat] coordinates.
The code first checks if the input starts with '['.
If it does we check the validity of the coordinates via a regex
/([\[][-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)[\]][\;]?)+/gm
The regex tests for [lon,lat] with 15 decimals [+- 180degrees, +-90degrees]
it should match :
single coordinates :
[120,80];
[120,80]
multiple coordinates
[180,90];[180,67];
[180,90];[180,67]
with newlines
[123,34];[-32,21];
[12,-67]
it should not match:
semicolon separator missing - single
[25,67][76,23];
semicolon separator missing - multiple
[25,67]
[76,23][12,90];
I currently have problems with the ; between coordinates (see 4 & 5)
jsfiddle equivalent here : http://regex101.com/r/vQ4fE0/4
You can try with this (human readable) pattern:
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<lon> [+-]?
(?:
180 (?:\.0{1,15})?
|
(?: 1(?:[0-7][0-9]?)? | [2-9][0-9]? | 0 )
(?:\.[0-9]{1,15})?
)
)
(?<lat> [+-]?
(?:
90 (?:\.0{1,15})?
|
(?: [1-8][0-9]? | 9)
(?:\.[0-9]{1,15})?
)
)
)
\A
\[ \g<lon> , \g<lat> ] (?: ; \n? \[ \g<lon> , \g<lat> ] )* ;?
\z
~x
EOD;
explanations:
When you have to deal with a long pattern inside which you have to repeat several time the same subpatterns, you can use several features to make it more readable.
The most well know is to use the free-spacing mode (the x modifier) that allows to indent has you want the pattern (all spaces are ignored) and eventually to add comments.
The second consists to define subpatterns in a definition section (?(DEFINE)...) in which you can define named subpatterns to be used later in the main pattern.
Since I don't want to repeat the large subpatterns that describes the longitude number and the latitude number, I have created in the definition section two named pattern "lon" and "lat". To use them in the main pattern, I only need to write \g<lon> and \g<lat>.
javascript version:
var lon_sp = '(?:[+-]?(?:180(?:\\.0{1,15})?|(?:1(?:[0-7][0-9]?)?|[2-9][0-9]?|0)(?:\\.[0-9]{1,15})?))';
var lat_sp = '(?:[+-]?(?:90(?:\\.0{1,15})?|(?:[1-8][0-9]?|9)(?:\\.[0-9]{1,15})?))';
var coo_sp = '\\[' + lon_sp + ',' + lat_sp + '\\]';
var regex = new RegExp('^' + coo_sp + '(?:;\\n?' + coo_sp + ')*;?$');
var coordinates = new Array('[120,80];',
'[120,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]');
for (var i = 0; i<coordinates.length; i++) {
console.log("\ntest "+(i+1)+": " + regex.test(coordinates[i]));
}
fiddle
Try this out:
^(\[([+-]?(?!(180\.|18[1-9]|19\d{1}))\d{1,3}(\.\d{1,15})?,[+-]?(?!(90\.|9[1-9]))\d{1,2}(\.\d{1,15})?(\];$|\]$|\];\[)){1,})
Demo: http://regex101.com/r/vQ4fE0/7
Explanation
^(\[
Must start with a bracket
[+-]?
May or may not contain +- in front of the number
(?!(180\.|18[1-9]|19\d{1}))
Should not contain 180., 181-189 nor 19x
\d{1,3}(\.\d{1,15})?
Otherwise, any number containing 1 or 3 digits, with or without decimals (up to 15) are allowed
(?!(90\.|9[1-9]))
The 90 check is similar put here we are not allowing 90. nor 91-99
\d{1,2}(\.\d{1,15})?
Otherwise, any number containing 1 or 2 digits, with or without decimals (up to 15) are allowed
(\];$|\]$|\];\[)
The ending of a bracket body must have a ; separating two bracket bodies, otherwise it must be the end of the line.
{1,}
The brackets can exist 1 or multiple times
Hope this was helpful.
This might work. Note that you have a lot of capture groups, none of which
will give you good information because of recursive quantifiers.
# /^(\[[-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)\](?:;\n?|$))+$/
^
( # (1 start)
\[
[-+]?
( # (2 start)
180
( \. 0{1,15} )? # (3)
|
( # (4 start)
( 1 [0-7] \d ) # (5)
|
( [1-9]? \d ) # (6)
) # (4 end)
( \. \d{1,15} )? # (7)
) # (2 end)
,
[-+]?
( # (8 start)
[1-8]? \d
( \. \d{1,15} )? # (9)
|
90
( \. 0{1,15} )? # (10)
) # (8 end)
\]
(?: ; \n? | $ )
)+ # (1 end)
$
Try a function approach, where the function can do some of the splitting for you, as well as delegating the number comparisons away from the regex. I tested it here: http://repl.it/YyG/3
//represents regex necessary to capture one coordinate, which
// looks like 123 or 123.13532
// the decimal part is a non-capture group ?:
var oneCoord = '(-?\\d+(?:\\.\\d+)?)';
//console.log("oneCoord is: "+oneCoord+"\n");
//one coordinate pair is represented by [x,x]
// check start/end with ^, $
var coordPair = '^\\['+oneCoord+','+oneCoord+'\\]$';
//console.log("coordPair is: "+coordPair+"\n");
//the full regex string consists of one or more coordinate pairs,
// but we'll do the splitting in the function
var myRegex = new RegExp(coordPair);
//console.log("my regex is: "+myRegex+"\n");
function isPlusMinus180(x)
{
return -180.0<=x && x<=180.0;
}
function isPlusMinus90(y)
{
return -90.0<=y && y<=90.0;
}
function isValid(s)
{
//if there's a trailing semicolon, remove it
if(s.slice(-1)==';')
{
s = s.slice(0,-1);
}
//remove all newlines and split by semicolon
var all = s.replace(/\n/g,'').split(';');
//console.log(all);
for(var k=0; k<all.length; ++k)
{
var match = myRegex.exec(all[k]);
if(match===null)
return false;
console.log(" match[1]: "+match[1]);
console.log(" match[2]: "+match[2]);
//break out if one pair is bad
if(! (isPlusMinus180(match[1]) && isPlusMinus90(match[2])) )
{
console.log(" one of matches out of bounds");
return false;
}
}
return true;
}
var coords = new Array('[120,80];',
'[120.33,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]',
'[190,33.33]',
'[180.33,33]',
'[179.87,90]',
'[179.87,91]');
var s;
for (var i = 0; i<coords.length; i++) {
s = coords[i];
console.log((i+1)+". ==== testing "+s+" ====");
console.log(" isValid? => " + isValid(s));
}

Having at least one-nonwhitespace

str must be true if it has at least one non-whitespace enclosed in the parenthesis:
str = (a)
str = ( as bs)
str = (as e)
and false if it has non-whitespace at all
str = ( )
Im not sure if i can do that + but this condition is also passing the 0 non-whitespaces. Correct it please.
/^\([\S+\s*]+\)$\.test(str)/
You can use this:
/^\(.*\S.*\)$/.test(str)
This matches any character, then a non-whitespace character (that makes it at least one non-whitespace character), and then any character till the end.
You can use the following:
^\((?!\s*\)).+\)$
This matches an open parentheses ( and then it fails if it is followed just by whitespaces and a ), or it accepts the entire line.
Assuming str must satisfy TRUE and FALSE and nesting is implicitly not allowed
^(?:[^()]*\([^\S()]*[^\s()][^\S()]*\))+[^()]*$
expanded
^
(?:
[^()]*
\(
[^\S()]*
[^\s()]
[^\S()]*
\)
)+
[^()]*
$