I'm trying to implement the removeComments function in Golang from this Javascript implementation. I'm hoping to remove any comments from the text. For example:
/* this is comments, and should be removed */
However, "/* this is quoted, so it should not be removed*/"
In the Javascript implementation, quoted matching are not captured in groups, so I can easily filter them out. However, in Golang, it seems it's not easy to tell whether the matched part is captured in a group or not. So how can I implement the same removeComments logic in Golang as the same in the Javascript version?
BACKGROUND
The correct way to do the task is to match and capture quoted strings (bearing in mind there can be escaped entities inside) and then matching the multiline comments.
REGEX IN-CODE DEMO
Here is the code to deal with that:
package main
import (
"fmt"
"regexp"
)
func main() {
reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*")|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`)
txt := `random text
/* removable comment */
"but /* never remove this */ one"
more random *text*`
fmt.Println(reg.ReplaceAllString(txt, "$1"))
}
See the Playground demo
EXPLANATION
The regex I suggest is written with the Best Regex Trick Ever concept in mind and consists of 2 alternatives:
("[^"\\]*(?:\\.[^"\\]*)*") - Double quoted string literal regex - Group 1 (see the capturing group formed with the outer pair of unescaped parentheses and later accessible via replacement backreferences) matching double quoted string literals that can contain escaped sequences. This part matches:
" - a leading double quote
[^"\\]* - 0+ characters other than " and \ (as [^...] construct is a negated character class that matches any characters but those defined inside it) (the * is a zero or more occurrences matching quantifier)
(?:\\.[^"\\]*)*" - 0+ sequences (see the last * and the non-capturing group used only to group subpatterns without forming a capture) of an escaped sequence (the \\. matches a literal \ followed with any character) followed with 0+
characters other than " and \
| - or
/\*[^*]*\*+(?:[^/*][^*]*\*+)*/ - multiline comment regex part matches *without forming a capture group (thus, unavailable from the replacement pattern via backreferences) and matches
/ - the / literal slash
\* - the literal asterisk
[^*]* - zero or more characters other than an asterisk
\*+ - 1 or more (the + is a one or more occurrences matching quantifier) asterisks
(?:[^/*][^*]*\*+)* - 0+ sequences (non-capturing, we do not use it later) of any character but a / or * (see [^/*]), followed with 0+ characters other than an asterisk (see [^*]*) and then followed with 1+ asterisks (see \*+).
/ - a literal (trailing, closing) slash.
NOTE: This multiline comment regex is the fastest I have ever tested. Same goes for the double quoted literal regex as "[^"\\]*(?:\\.[^"\\]*)*" is written with the unroll-the-loop technique in mind: no alternations, only character classes with * and + quantifiers are used in a specific order to allow the fastest matching.
NOTES ON PATTERN ENHANCEMENTS
If you plan to extend to matching single quoted literals, there is nothing easier, just add another alternative into the 1st capture group by re-using the double quoted string literal regex and replacing the double quotes with single ones:
reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`)
^-------------------------^
Here is the single- and double-quoted literal supporting regex demo removing the miltiline comments
Adding a single line comment support is similar: just add //[^\n\r]* alternative to the end:
reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*[\r\n]*`)
^-----------^
Here is single- and double-quoted literal supporting regex demo removing the miltiline and singleline comments
These do not preserve formatting
Preferred way (produces a NULL if group 1 is not matched)
works in golang playground -
# https://play.golang.org/p/yKtPk5QCQV
# fmt.Println(reg.ReplaceAllString(txt, "$1"))
# (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*(?:\n|$))|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|[\S\s][^/"'\\]*)
(?: # Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// [^\n]* # Start // comment
(?: \n | $ ) # End // comment
)
|
( # (1 start), Non - comments
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
|
'
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape, or line continuation (escape + newline)
) # (1 end)
Alternative way (group 1 is always matched, but could be empty)
works in golang playground -
# https://play.golang.org/p/7FDGZSmMtP
# fmt.Println(reg.ReplaceAllString(txt, "$1"))
# (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*(?:\n|$))?((?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|[\S\s][^/"'\\]*)?)
(?: # Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// [^\n]* # Start // comment
(?: \n | $ ) # End // comment
)?
( # (1 start), Non - comments
(?:
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
|
'
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape, or line continuation (escape + newline)
)?
) # (1 end)
The Cadilac - Preserves Formatting
(Unfortunately, this can't be done in Golang because Golang cannot do Assertions)
Posted incase you move to a different regex engine.
# raw: ((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n|[\S\s])[^/"'\\\s]*)
# delimited: /((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n|[\S\s])[^\/"'\\\s]*)/
( # (1 start), Comments
(?:
(?: ^ [ \t]* )? # <- To preserve formatting
(?:
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
(?: # <- To preserve formatting
[ \t]* \r? \n
(?=
[ \t]*
(?: \r? \n | /\* | // )
)
)?
|
// # Start // comment
(?: # Possible line-continuation
[^\\]
| \\
(?: \r? \n )?
)*?
(?: # End // comment
\r? \n
(?= # <- To preserve formatting
[ \t]*
(?: \r? \n | /\* | // )
)
| (?= \r? \n )
)
)
)+ # Grab multiple comment blocks if need be
) # (1 end)
| ## OR
( # (2 start), Non - comments
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
|
'
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
|
(?: \r? \n | [\S\s] ) # Linebreak or Any other char
[^/"'\\\s]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)
I've never read/written anything in Go, so bear with me. Fortunately, I know regex. I did a little research on Go regexes, and it would seem that they lack most modern features (such as references).
Despite that, I've developed a regex that seems to be what you're looking for. I'm assuming that all strings are single line. Here it is:
reg := regexp.MustCompile(`(?m)^([^"\n]*)/\*([^*]+|(\*+[^/]))*\*+/`)
txt := `random text
/* removable comment */
"but /* never remove this */ one"
more random *text*`
fmt.Println(reg.ReplaceAllString(txt, "${1}"))
Variation: The version above will not remove comments that happen after quotation marks. This version will, but it may need to be run multiple times.
reg := regexp.MustCompile(
`(?m)^(([^"\n]*|("[^"\n]*"))*)/\*([^*]+|(\*+[^/]))*\*+/`
)
txt := `
random text
what /* removable comment */
hi "but /* never remove this */ one" then /*whats here*/ i don't know /*what*/
more random *text*
`
newtxt := reg.ReplaceAllString(txt, "${1}")
fmt.Println(newtxt)
newtxt = reg.ReplaceAllString(newtxt, "${1}")
fmt.Println(newtxt)
Explanation
(?m) means multiline mode. Regex101 gives a nice explanation of this:
The ^ and $ anchors now match at the beginning/end of each line respectively, instead of beginning/end of the entire string.
It needs to be anchored to the beginning of each line (with ^) to ensure a quote hasn't started.
The first regex has this: [^"\n]*. Essentially, it's matching everything that's not " or \n. I've added parenthesis because this stuff isn't comments, so it needs to be put back.
The second regex has this: (([^"\n]*|("[^"\n]*"))*). The regex, with this statement can either match [^"\n]* (like the first regex does), or (|) it can match a pair of quotes (and the content between them) with "[^"\n]*". It's repeating so it works when there are more than one quote pair, for example. Note that, like the simpler regex, this non-comment stuff is being captured.
Both regexes use this: /\*([^*]+|(\*+[^/]))*\*+/. It matches /* followed by any amount of either:
[^*]+ Non * chars
or
\*+[^/] One or more *s that are not followed by /.
And then it matches the closing */
During replacement, the ${1} refers to the non-comment things that were captured, so they're reinserted into the string.
Just for fun another approach, minimal lexer implemented as state machine, inspired by and well described in Rob Pike talk http://cuddle.googlecode.com/hg/talk/lex.html. Code is more verbose but more readable, understandable and hackable then regexp. Also it can work with any Reader and Writer, not strings only so don't consumes RAM and should even be faster.
type stateFn func(*lexer) stateFn
func run(l *lexer) {
for state := lexText; state != nil; {
state = state(l)
}
}
type lexer struct {
io.RuneReader
io.Writer
}
func lexText(l *lexer) stateFn {
for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
switch r {
case '"':
l.Write([]byte(string(r)))
return lexQuoted
case '/':
r, _, err = l.ReadRune()
if r == '*' {
return lexComment
} else {
l.Write([]byte("/"))
l.Write([]byte(string(r)))
}
default:
l.Write([]byte(string(r)))
}
}
return nil
}
func lexQuoted(l *lexer) stateFn {
for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
if r == '"' {
l.Write([]byte(string(r)))
return lexText
}
l.Write([]byte(string(r)))
}
return nil
}
func lexComment(l *lexer) stateFn {
for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
if r == '*' {
r, _, err = l.ReadRune()
if r == '/' {
return lexText
}
}
}
return nil
}
You can see it works http://play.golang.org/p/HyvEeANs1u
Demo
Play golang demo
(The workings at each stage are output and the end result can be seen by scrolling down.)
Method
A few "tricks" are used to work around Golang's somewhat limited regex syntax:
Replace start quotes and end quotes with a unique character. Crucially, the characters used to identify start and end quotes must be different from each other and extremely unlikely to appear in the text being processed.
Replace all comment starters (/*) that are not preceeded by an unterminated start quote with a unique sequence of one or more characters.
Similarly, replace all comment enders (*/) that are not succeeded by an end quote that does not have a start quote before it with a different unique sequence of one or more characters.
Remove all remaining /*...*/ comment sequences.
Unmask the previously "masked" comment starters/enders by reversing the replacements made in steps 2 and 3 above.
Limitations
The current demo doesn't address the possibility of a double quote appearing within a comment, e.g. /* Not expected: " */. Note: My feeling is this could be handled - just haven't put the effort in yet - so let me know if you think it could be an issue and I'll look into it.
Try this example..
play golang
Related
I have this pattern (?<!')(\w*)\((\d+|\w+|.*,*)\) that is meant to match strings like:
c(4)
hello(54, 41)
Following some answers on SO, I added a negative lookbehind so that if the input string is preceded by a ', the string shouldn't match at all. However, it still partially matches.
For example:
'c(4) returns (4) even though it shouldn't match anything because of the negative lookbehind.
How do I make it so if a string is preceded by ' NOTHING matches?
Since nobody came along, I'll throw this out to get you started.
This regex will match things like
aa(a , sd,,,f,)
aa( as , " ()asdf)) " ,, df, , )
asdf()
but not
'ab(s)
This will fix the basic problem (?<!['\w])\w*
Where (?<!['\w]) will not let the engine skip over a word char just
to satisfy the not quote.
Then the optional words \w* to grab all the words.
And if a 'aaa( quote is before it, then it won't match.
This regex here embellishes what I think you are trying to accomplish
in the function body part of your regex.
It might be a little overwhelming to understand at first.
(?s)(?<!['\w])(\w*)\(((?:,*(?&variable)(?:,+(?&variable))*[,\s]*)?)\)(?(DEFINE)(?<variable>(?:\s*(?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')\s*|[^()"',]+)))
Readable version (via: http://www.regexformat.com)
(?s) # Dot-all modifier
(?<! ['\w] ) # Not a quote, nor word behind
# <- This will force matching a complete function name
# if it exists, thereby blocking a preceding quote '
( \w* ) # (1), Function name (optional)
\(
( # (2 start), Function body
(?: # Parameters (optional)
,* # Comma (optional)
(?&variable) # Function call, get first variable (required)
(?: # More variables (optional)
,+ # Comma (required)
(?&variable) # Variable (required)
)*
[,\s]* # Whitespace or comma (optional)
)? # End parameters (optional)
) # (2 end)
\)
# Function definitions
(?(DEFINE)
(?<variable> # (3 start), Function for a single Variable
(?:
\s*
(?: # Double or single quoted string
"
[^"\\]*
(?: \\ . [^"\\]* )*
"
|
'
[^'\\]*
(?: \\ . [^'\\]* )*
'
)
\s*
| # or,
[^()"',]+ # Not quote, paren, comma (can be whitespace)
)
) # (3 end)
)
I'm creating a powershell script that parses a file containing C code and detects if it contains calls to free(), malloc() or realloc() functions.
file_one.c
int MethodOne()
{
return 1;
}
int MethodTwo()
{
free();
return 1;
}
file_two.c
int MethodOne()
{
//free();
return 1;
}
int MethodTwo()
{
free();
return 1;
}
check.ps1
$regex = "(^[^/]*free\()|(^[^/]*malloc\()|(^[^/]*realloc\()"
$file_one= "Z:\PATH\file_one.txt"
$file_two= "Z:\PATH\file_two.txt"
$contentOne = Get-Content $file_one -Raw
$contentOne -match $regex
$contentTwo = Get-Content $file_two-Raw
$contentTwo -match $regex
processing the whole file in a time seems to work well with contentOne,
in fact I get True (because of the free() in MethodTwo).
Processing contentTwo is not so lucky and returns False instead of True
(because of the free() in MethodTwo).
Can someone help me to write a better regex that works in both cases?
Sure, this is it
Raw:
^(?>(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n))|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\b(?:free|malloc|realloc)\()[\S\s](?:(?!\b(?:free|malloc|realloc)\()[^/"'\\])*))*(?:(\bfree\()|(\bmalloc\()|(\brealloc\())
Stringed:
"^(?>(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\(?:\\r?\\n)?)*?(?:\\r?\\n))|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\b(?:free|malloc|realloc)\\()[\\S\\s](?:(?!\\b(?:free|malloc|realloc)\\()[^/\"'\\\\])*))*(?:(\\bfree\\()|(\\bmalloc\\()|(\\brealloc\\())"
Verbatim:
#"^(?>(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n))|(?:""[^""\\]*(?:\\[\S\s][^""\\]*)*""|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\b(?:free|malloc|realloc)\()[\S\s](?:(?!\b(?:free|malloc|realloc)\()[^/""'\\])*))*(?:(\bfree\()|(\bmalloc\()|(\brealloc\())"
Explained
^
(?>
(?: # Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: # Possible line-continuation
[^\\]
| \\
(?: \r? \n )?
)*?
(?: \r? \n ) # End // comment
)
| # OR,
(?: # Non - comments
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
| '
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
| # OR,
(?! # ASSERT: Here, cannot be free / malloc / realloc {}
\b
(?: free | malloc | realloc )
\(
)
[\S\s] # Any char which could start a comment, string, etc..
# (Technically, we're going past a C++ source code error)
(?: # -------------------------
(?! # ASSERT: Here, cannot be free / malloc / realloc {}
\b
(?: free | malloc | realloc )
\(
)
[^/"'\\] # Char which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)* # -------------------------
) # Done Non - comments
)*
(?:
( \b free\( ) # (1), Free()
|
( \b malloc\( ) # (2), Malloc()
|
( \b realloc\( ) # (3), Realloc()
)
Some notes:
This only finds the first one from the beginning of string using ^ anchor.
To find them all, just remove the ^ from the regex.
This works because it matches everything up to what you're looking for.
In this case, what it found is in capture group 1, 2, or 3.
Good Luck !!
What the regex contains:
----------------------------------
* Format Metrics
----------------------------------
Atomic Groups = 1
Cluster Groups = 10
Capture Groups = 3
Assertions = 2
( ? ! = 2
Free Comments = 25
Character Classes = 12
edit
Per request, explanation of the part of the regex that handles
/**/ comments. This -> /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
This is a modified unrolled-loop regex that assumes an opening delimiter
of /* and a closing one of */.
Notice that the open/close share a common character / in it's delimiter
sequence.
To be able to do this without lookaround assertions, a method is used
to shift the trailing delimiter's asterisk inside the loop.
Using this factoring, all that's needed is to check for a closing /
to complete the delimited sequence.
/\* # Opening delimiter /*
[^*]* # Optionally, consume all non-asterisks
\*+ # This must be 1 or more asterisks anchor's or FAIL.
# This is matched here to align the optional loop below
# because it is looking for the closing /.
(?: # The optional loop part
[^/*] # Specifically a single non / character (nor asterisk).
# Since a / will be the next closing delimiter, it must be excluded.
[^*]* # Optional non-asterisks.
# This will accept a / because it is supposed to consume ALL
# opening delimiter's as it goes
# and will consider the very next */ as a close.
\*+ # This must be 1 or more asterisks anchor's or FAIL.
)* # Repeat 0 to many times.
/ # Closing delimiter /
In the header of every file in my project I have a block of comments that were generated by VCC with revision history for the file. We have moved away from VCC and no longer want to have the revision history in the file since it is obsolete.
I currently have a search pcregrep search that returns the exact results I'm looking for:
pcregrep -rM '(\$Rev)(?s)(.+?(?=\*\*))' *
I have tried piping the results into an xargs sed along with many other attempts to remove all the returned lines from the files but I get various errors including "File name too long"
I want to delete the entire block
Since you are talking about C++ files, you can't just find comments,
you have to parse comments because literal strings could contain comment
delimiters.
This has been done before, no use re-inventing the wheel.
A simple grep is not going to cut it. You need a simple macro or C# console app
that has better capabilities.
If you want to go this route, below is a regex for you.
Each match will either match group 1 (comments block) or group 2 (non-comments).
You need to either rewrite a new string, via appending the results of each match.
Or, use a callback function to do the replacement.
Each time it matches group 2 just append that (or return it if callback) unchanged.
When it matches group 1, you have to run another regular expression on the
contents to see if the comment block contains the Revision information.
If it does contain it, don't append (or return "" if callback) its contents.
If it doesn't, just append it unchanged.
So, its a 2 step process.
Pseudo-code:
// here read in the source sting from file.
string strSrc = ReadFile( name );
string strNew = "";
Matcher CmtMatch, RevMatch;
while ( GloballyFind( CommentRegex, strSrc, CmtMatch ) )
{
if ( CmtMatch.matched(1) )
{
string strComment = Match.value(1);
if ( FindFirst( RevisionRegex, strComment, RevMatch ) )
continue;
else
strNew += strComment;
}
else
strNew += Match.value(2);
}
// here write out the new string.
The same could be done via a ReplaceAll() using a callback function, if
using a Macro language. The logic goes in the callback.
Its not as hard as it looks, but if you want to do it right I'd do it this way.
And then, hey, you got a nifty utility to be used again.
Here is the regex expanded=, formatted and compressed.
(constructed with RegexFormat 6 (Unicode))
# raw: ((?:(?:^\h*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:\h*\n(?=\h*(?:\n|/\*|//)))?|//(?:[^\\]|\\\n?)*?(?:\n(?=\h*(?:\n|/\*|//))|(?=\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\\s]*)
# delimited: /((?:(?:^\h*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:\h*\n(?=\h*(?:\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\\n?)*?(?:\n(?=\h*(?:\n|\/\*|\/\/))|(?=\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^\/"'\\\s]*)/
# Dbl-quoted: "((?:(?:^\\h*)?(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/(?:\\h*\\n(?=\\h*(?:\\n|/\\*|//)))?|//(?:[^\\\\]|\\\\\\n?)*?(?:\\n(?=\\h*(?:\\n|/\\*|//))|(?=\\n))))+)|(\"(?:\\\\[\\S\\s]|[^\"\\\\])*\"|'(?:\\\\[\\S\\s]|[^'\\\\])*'|[\\S\\s][^/\"'\\\\\\s]*)"
# Sing-quoted: '((?:(?:^\h*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:\h*\n(?=\h*(?:\n|/\*|//)))?|//(?:[^\\\]|\\\\\n?)*?(?:\n(?=\h*(?:\n|/\*|//))|(?=\n))))+)|("(?:\\\[\S\s]|[^"\\\])*"|\'(?:\\\[\S\s]|[^\'\\\])*\'|[\S\s][^/"\'\\\\\s]*)'
( # (1 start), Comments
(?:
(?: ^ \h* )? # <- To preserve formatting
(?:
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
(?: # <- To preserve formatting
\h* \n
(?=
\h*
(?: \n | /\* | // )
)
)?
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
(?: # End // comment
\n
(?= # <- To preserve formatting
\h*
(?: \n | /\* | // )
)
| (?= \n )
)
)
)+ # Grab multiple comment blocks if need be
) # (1 end)
| ## OR
( # (2 start), Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\\s]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)
Incase you want something simpler -
This is the same regex without multiple comment block capture or format preserving. Same grouping and replacement principle applies.
# Raw: (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)
( # (1 start), Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
) # (1 end)
|
( # (2 start), Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)
I'm working on a gettext javascript parser and I'm stuck on the parsing regex.
I need to catch every argument passed to a specific method call _n( and _(. For example, if I have these in my javascript files:
_("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
This refs this documentation: http://poedit.net/trac/wiki/Doc/Keywords
I'm planning in doing it in two times (and two regex):
catch all function arguments for _n( or _( method calls
catch the stringy ones only
Basically, I'd like a Regex that could say "catch everything after _n( or _( and stop at the last parenthesis ) actually when the function is done. I dunno if it is possible with regex and without a javascript parser.
What could also be done is "catch every "string" or 'string' after _n( or _( and stop at the end of the line OR at the beginning of a new _n( or _( character.
In everything I've done I get either stuck on _( "one (optional)" ); with its inside parenthesis or apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) with two calls on the same line.
Here is what I implemented so far, with un-perfect regex: a generic parser and the javascript one or the handlebars one
Note: Read this answer if you're not familiar with recursion.
Part 1: match specific functions
Who said that regex can't be modular? Well PCRE regex to the rescue!
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
_n? # Match _ or _n
\s* # Optional white spaces
(?P<results>(?&brackets)) # Recurse/use the brackets pattern and put it in the results group
~sx
The s is for matching newlines with . and the x modifier is for this fancy spacing and commenting of our regex.
Online regex demo
Online php demo
Part 2: getting rid of opening & closing brackets
Since our regex will also get the opening and closing brackets (), we might need to filter them. We will use preg_replace() on the results:
~ # Delimiter
^ # Assert begin of string
\( # Match an opening bracket
\s* # Match optional whitespaces
| # Or
\s* # Match optional whitespaces
\) # Match a closing bracket
$ # Assert end of string
~x
Online php demo
Part 3: extracting the arguments
So here's another modular regex, you could even add your own grammar:
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<array>
Array\s*
(?&brackets)
)
(?P<variable>
[^\s,()]+ # I don't know the exact grammar for a variable in ECMAScript
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
(?&array) # Recurse/use the array pattern
| # Or
(?&variable) # Recurse/use the array pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&variable)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
~xis
We will loop and use preg_match_all(). The final code would look like this:
$functionPattern = <<<'regex'
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
_n? # Match _ or _n
\s* # Optional white spaces
(?P<results>(?&brackets)) # Recurse/use the brackets pattern and put it in the results group
~sx
regex;
$argumentsPattern = <<<'regex'
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<array>
Array\s*
(?&brackets)
)
(?P<variable>
[^\s,()]+ # I don't know the exact grammar for a variable in ECMAScript
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
(?&array) # Recurse/use the array pattern
| # Or
(?&variable) # Recurse/use the array pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
|
(?&variable)
~six
regex;
$input = <<<'input'
_ ("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
// misleading cases
_n("foo (")
_n("foo (\)", 'foo)', aa)
_n( Array(1, 2, 3), Array(")", '(') );
_n(function(foo){return foo*2;}); // Is this even valid?
_n (); // Empty
_ (
"Foo",
'Bar',
Array(
"wow",
"much",
'whitespaces'
),
multiline
); // PCRE is awesome
input;
if(preg_match_all($functionPattern, $input, $m)){
$filtered = preg_replace(
'~ # Delimiter
^ # Assert begin of string
\( # Match an opening bracket
\s* # Match optional whitespaces
| # Or
\s* # Match optional whitespaces
\) # Match a closing bracket
$ # Assert end of string
~x', // Regex
'', // Replace with nothing
$m['results'] // Subject
); // Getting rid of opening & closing brackets
// Part 3: extract arguments:
$parsedTree = array();
foreach($filtered as $arguments){ // Loop
if(preg_match_all($argumentsPattern, $arguments, $m)){ // If there's a match
$parsedTree[] = array(
'all_arguments' => $arguments,
'branches' => $m[0]
); // Add an array to our tree and fill it
}else{
$parsedTree[] = array(
'all_arguments' => $arguments,
'branches' => array()
); // Add an array with empty branches
}
}
print_r($parsedTree); // Let's see the results;
}else{
echo 'no matches';
}
Online php demo
You might want to create a recursive function to generate a full tree. See this answer.
You might notice that the function(){} part isn't parsed correctly. I will let that as an exercise for the readers :)
Try this:
(?<=\().*?(?=\s*\)[^)]*$)
See live demo
Below regex should help you.
^(?=\w+\()\w+?\(([\s'!\\\)",\w]+)+\);
Check the demo here
\(( |"(\\"|[^"])*"|'(\\'|[^'])*'|[^)"'])*?\)
This should get anything between a pair of parenthesis, ignoring parenthesis in quotes.
Explanation:
\( // Literal open paren
(
| //Space or
"(\\"|[^"])*"| //Anything between two double quotes, including escaped quotes, or
'(\\'|[^'])*'| //Anything between two single quotes, including escaped quotes, or
[^)"'] //Any character that isn't a quote or close paren
)*? // All that, as many times as necessary
\) // Literal close paren
No matter how you slice it, regular expressions are going to cause problems. They're hard to read, hard to maintain, and highly inefficient. I'm unfamiliar with gettext, but perhaps you could use a for loop?
// This is just pseudocode. A loop like this can be more readable, maintainable, and predictable than a regular expression.
for(int i = 0; i < input.length; i++) {
// Ignoring anything that isn't an opening paren
if(input[i] == '(') {
String capturedText = "";
// Loop until a close paren is reached, or an EOF is reached
for(; input[i] != ')' && i < input.length; i++) {
if(input[i] == '"') {
// Loop until an unescaped close quote is reached, or an EOF is reached
for(; (input[i] != '"' || input[i - 1] == '\\') && i < input.length; i++) {
capturedText += input[i];
}
}
if(input[i] == "'") {
// Loop until an unescaped close quote is reached, or an EOF is reached
for(; (input[i] != "'" || input[i - 1] == '\\') && i < input.length; i++) {
capturedText += input[i];
}
}
capturedText += input[i];
}
capture(capturedText);
}
}
Note: I didn't cover how to determine if it's a function or just a grouping symbol. (ie, this will match a = (b * c)). That's complicated, as is covered in detail here. As your code gets more and more accurate, you get closer and closer to writing your own javascript parser. You might want to take a look at the source code for actual javascript parsers if you need that sort of accuracy.
One bit of code (you can test this PHP code at http://writecodeonline.com/php/ to check):
$string = '_("foo")
_n("bar", "baz", 42);
_n(domain, "bux", var);
_( "one (optional)" );
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples)';
preg_match_all('/(?<=(_\()|(_n\())[\w", ()%]+(?=\))/i', $string, $matches);
foreach($matches[0] as $test){
$opArr = explode(',', $test);
foreach($opArr as $test2){
echo trim($test2) . "\n";
}
}
you can see the initial pattern and how it works here: http://regex101.com/r/fR7eU2/1
Output is:
"foo"
"bar"
"baz"
42
domain
"bux"
var
"one (optional)"
"No apples"
"%1 apple"
"%1 apples"
apples
We can do this in two steps:
1)catch all function arguments for _n( or _( method calls
(?:_\(|_n\()(?:[^()]*\([^()]*\))*[^()]*\)
See demo.
http://regex101.com/r/oE6jJ1/13
2)catch the stringy ones only
"([^"]*)"|(?:\(|,)\s*([^"),]*)(?=,|\))
See demo.
http://regex101.com/r/oE6jJ1/14
Optimize?
I am having trouble with the following regular expression:
/^\s*(/?\*{1,2}(\s*(\b.*\b)\s*(\*/)?)?|\*/?\s*)$/g
I am wondering if I can improve this expression? Also, if anyone can find a problem with this expression, could you take note of it. Here is my live demo in action. It works for all the conditions that I have set up below
Test constraints
These Match
/**
/*
*
*/
/** Javadoc */
/* Block */
* Multi-line
/* Single Line */
/** A
/** A */
/* A
/* A */
These Shouldn't
7 * 8
// Regular comment
Results
After replacing the match with: // $3
I successfully converted them, despite some of them having trailing white-space:
//
//
//
// Javadoc
// Block
// Multi-line
// Single Line
// A
// A
// A
// A
Regex explained
/
^ Line start
\s* 0 or more white-space
( Start group 1
/? forward-slash (OPTIONAL)
\*{1,2} 1 to 2 asterisks
( Start group 2
\s* 0 or more white-space
( Start group 3
\b Start word boundry
.* 0 or more of anything
\b End word boundry
) End group 3
\s* 0 or more white-space
( Start group 4 (OPTIONAL)
\* 0 or more asterisks
/ Forward-slash
)? End group 4
)? End group 2 (OPTIONAL)
| OR
\* Asterisk
/? Forward-slash (OPTIONAL)
\s* 0 or more white-space
) End group 1
$ Line end
/
g Global; match all
Parsing C/C++ style comments are a little more complex when you have to take into account string embedding, escaping, and line continuation.
This also requires a single character consumption pass-thru method (after all else is checked). That being the case, of the many possible match alternations, there is only one item you're interrested in - the C-style comment /* */.
So, in the case of a global search and a lot of matches, only one interrests you, say a capture group 1 match. Of course, in your case a simple replace won't cut it.
So, you have to sit in a global find (but not replace) loop. Each match will be a part of the source string, no part is skipped, so each one will be appended to a destination string.
During the loop, when capture group 1 matches (C-style comment), you can do your not-so-simple substitution's to make it C++ comment, then append that result to the destination string.
Thats the jist of it. If this can't be done in whatever language you use, then it can't be done right, that is for sure!
Here is a list of regex's to perform the conversion to // when you'vee captured a C-style comment, they have to be performed in the order of appearance (the notation is Perl for example only):
# s{ ^ /\* (?: [^\S\n] | \*)* }{// }xmg;
# s{ ^ (?: [^\S\n] | \*)* \*/ $ }{// }xmg;
# s{ ^ $ }{// }xmg;
# s{ ^ (?: [^\S\n] | \*)+ }{// }xmg;
# s{ (?<![\s*]) (?: [^\S\n] | \*)+ $ }{}xmg;
# s{ (?<![\s*]) (?: [^\S\n] | \*)* \*/ $ }{}xmg;
This is the regex to use in your find loop (as explained above). This regex can be found on any Perl news group, its part of the FAQ.
# (?:(/\*[^*]*\*+(?:[^/*][^*]*\*+)*/)|//(?:[^\\]|\\\n?)*?\n)|(?:"(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)
# -----------------------------------------------------------------
(?: # Comments
( # (1 start)
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
) # (1 end)
|
// # Start // comment
(?: [^\\] | \\ \n? )*? # Possible line-continuation
\n # End // comment
)
|
(?: # Non - comments
"
(?: \\ [\S\s] | [^"\\] )* # Double quoted text
"
| '
(?: \\ [\S\s] | [^'\\] )* # Single quoted text
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
)