Regex to select newlines in between quotation marks - regex

I have a string in Ruby that resembles the following:
{
"a boolean": true,
"multiline": "
my
multiline
value
",
"a normal key": "a normal value"
}
I would like to match only the newline characters in the substring:
"
my
multiline
value
",
This is so that I can replace them with escaped newline characters. The aim here is to make the JSON easier to work with in the long run.

UPdate - These regex work as expected.
From #faissaloo - it seemed to fail however on my large JSON.
I ran this large string using both regex:
PCRE https://regex101.com/r/3jtqea/1
Ruby https://regex101.com/r/1HVCCC/1
They both work identically, and without flaw.
If you have any other concerns, please let me know.
I think Ruby supports the Perl like constructs.
If so, it can be done in a single global find and replace.
Like this:
Edit - Ruby doesn't do Backtracking Control Verbs (*SKIP)(*FAIL)
so, to do this in Ruby code, requires the regex to be more explicit.
So, with a slight modification of the pcre/perl regex, the Ruby equivalent is:
Ruby
Find
(?-m)((?!\A)\G|(?:(?>[^"]*"[^"\r\n]*"[^"]*))*")([^"\r\n]*)\K\r?\n(?=[^"]*")((?:[^"\r\n]*"(?:(?>[^"]*"[^"\r\n]*"))*[^"]*)?)
Replace
\\n\3
https://regex101.com/r/BaqjEE/1
https://rextester.com/NVFD38349
Explained ( but it's complex )
(?-m) # Non-multiline mode safety check
( # (1 start), Prefix. Capture for debug
(?! \A ) # Not BOS
\G # Test where last match left off
| # or,
(?: # Optionally align to next " ( only used once )
(?> [^"]* " [^"\r\n]* " [^"]* )
)*
" # A new quote to test
) # (1 end)
( [^"\r\n]* ) # (2), Line break Preamble. Capture for debug
\K # Exclude from the match (group 0) up to this point
\r? \n # Line break to escape
(?= [^"]* " ) # Validate we have " closure
( # (3 start), Optional end quote and alignment.
# To be written back.
(?:
[^"\r\n]* "
(?: # Optionally align to next "
(?> [^"]* " [^"\r\n]* " )
)*
[^"]*
)?
) # (3 end)
# Ruby Code:
#----------------------
# #ruby 2.3.1
#
# re = /(?-m)((?!\A)\G|(?:(?>[^"]*"[^"\r\n]*"[^"]*))*")([^"\r\n]*)\K\r?\n(?=[^"]*")((?:[^"\r\n]*"(?:(?>[^"]*"[^"\r\n]*"))*[^"]*)?)/
# str = '{
# "a boolean": true,
# "a boolean": true,
# "a boolean": true,
# "a boolean": true,
# "multiline": "
# my
# multiline
# value
# asdf"
# ,
#
# "a multiline boo
# lean": true,
# "a normal key": "a multiline
#
# value"
# }'
# subst = '\\n\3'
#
# result = str.gsub(re, subst)
#
# # Print the result of the substitution
# puts result
For Pcre/Perl
Find
(?:((?:(?>[^"]*"[^"\n]*"[^"]*))+(*SKIP)(*FAIL)|"|(?!^)\G)([^"\n]*)\K\n(?=[^"]*")((?:[^"\n]*")?))
Replace
\\n$3
https://regex101.com/r/06naae/1
Explained ( but it's complex )
Note if you're on a windows box where editors need CRLF breaks,
add a \r in front of the LF, like this \r\n.
(?:
( # (1 start), Prefix capture, for debug
(?:
(?> [^"]* " [^"\n]* " [^"]* )
)+
(*SKIP) (*FAIL) # Consume false positives, but ignore them
# (need this to align next ")
| # or,
" # A new quote to test
| # or,
(?! ^ ) # Not BOS
\G # Test where last match left off
) # (1 end)
( [^"\n]* ) # (2), Preamble capture, for debug
\K # Exclude from the match (group 0) up to this point
\n # Line break to escape
(?= [^"]* " ) # Validate we have " closure
( # (3 start), End quote, to be written back
(?: [^"\n]* " )?
) # (3 end)
)

I think this could help you. You capture the \n inside the string and then can replace it:
"[^"]*(\n)*",
Test it

Another option would be something like this:
string = '{
"a boolean": true,
"multiline": "my
multiline
value",
"a normal value"
}'
puts string.match(/"(\w+)(\n+\w*)+"/).to_s.gsub!("\n", '\n')
This matches the regex in your string and then substitutes the newlines with escaped newlines.

Late answer, but you can use a regex like:
'"(?=\n).*?"'
Matches:
"
my
multiline
value
",
Demo:
Regex Demo & Explanation

If your multiline strings don't include commas (right before the line breaks) then you could use that in json every line has to end with ,, {, or [ or the next line has to start with } or ]:
json_string.gsub(/(?<!,|\{|\[)\n(?!\s*[}\]])/, '\n')
If you have commas in your strings (or curly and square brackets) you can improve this approach by adding more details to the list of valid line endings:
valid_line_ends = %w(true, false, ", }, ], { [)
line_end_matcher = valid_line_ends.map(&Regexp.method(:escape)).join('|')
json_string.gsub(/(?<!#{line_end_matcher})\n(?!\s*[}\]])/, '\n')

Related

Regex Ruby How to group every word within parentheses

I'm trying to get all the words between the parentheses after a specific word and the end of the string.
For example, I have this case:
p " some other text in downcase LOREM (foo, bar)".scan(/ LOREM \((.*?)\)\z/m)
# [["foo, bar"]]
The regex is getting foo, bar which is between the parenthesis, it's okay, but I'd like to get them like two separate elements within a single array, meaning:
["foo", "bar"]
That's to say, the regex should group every words as a separate element.
My intention is to get everything between LOREM ( and the last closing parenthesis ).
I've tried adding (\b\w+\b), which groups every word in the string. But when adding it to the attempt to get the words from the parenthesis, it returns nothing.
You may use
.scan(/(?:\G(?!\A)\s*,\s*|\sLOREM\s+\()\K\w+(?=[^()]*\)\z)/
See the Ruby demo and the Rubular regex demo. You may replace \w+ with [[:alnum:]]+, or \p{L}+ (to only match letters), or [^\s,()]+ (to match any 1+ chars other than whitespace, ,, ( and )), it all depends on what you want to match inside the paretheses.
Details
(?:\G(?!\A)\s*,\s*|\sLOREM\s+\() - either the end of the previous successful match and a , enclosed with 0+ whitespaces, or whitespace, LOREM, 1+ whitespaces and (
\K - omit the text matched so far
\w+ - consume 1+ word chars
(?=[^()]*\)\z) - immediately to the right, there must be 0 or more chars other than ( and ) and then ) at the end of the string.
r = /
(?<= # begin a positive lookbehind
LOREM[ ] # match 'LOREM '
\( # match left paren
| # or
,[ ] # match a comma followed by a space
) # end positive lookbehind
(?: # begin a non-capture group
[^, ")]+ # match one or more characters other than ',', ' ', '"' and ")"
| # or
\" # match a double quote
[^, ")]+ # match one or more characters other than ',', ' ', '"' and ")"
\" # match a double quote
) # end non-capture group
(?= # begin a positive lookahead
.*\) # match any number of characters followed by a right paren
) # end positive lookahead
/x # free-spacing regex definition mode
Conventionally this is written
r = /(?<=LOREM \(|, )(?:[^, ")]+|\"[^, ")]+\")(?=.*\))/
Let's try it.
str = "some other text in downcase LOREM (foo, \"bar\", \"baz), daz"
str.scan(r)
#=> ["foo", "\"bar\""]
The first match, "foo", matches
str.scan /(?<=LOREM \()[^, ")]+/
#=> ["foo"]
That is, this matches one or more characters other than a comma, space, double quote or left parenthesis, immediately preceded by "LOREM " followed by a left parenthesis.
The next attempted match begins at the end of "foo". There is no match of "L" in "LOREM" so an attempt is made to match ", ", which is met with success. [^, ")]+ does not match "bar", so an attempt is made to match \"[^, ")]+\", which is successful. As ", " is matched within the lookaround it is not part of the match returned. This matches '"bar"'.
\"baz is not matched because it has no closing double quote.

Using regular expression extractor to extract a value?

I am trying to extract the value from the following code. Even though my regex expression is fine it is still not extracting the value.
token" value="(.+?)"
this does give me the exact match which I checked using regex101.com
<input type="hidden" name="token" value="GSYGEP2UUWOTMZ2SFV1G5D2M8L247KIG">
what the regex expression should be
Your original regular expression is just fine:
value="(.+?)"
It might be additional spaces, or code problems with it. Let's remove the token" or try to escape ", if necessary.
DEMO 1
DEMO 2
Reference:
Regular Expressions
Try this
<input(?=(?:[^>"']|"[^"]*"|'[^']*')*?\sname\s*=\s*(['"])\s*token\s*\1)(?=(?:[^>"']|"[^"]*"|'[^']*')*?\svalue\s*=\s*(['"])((?:(?!\2)[\S\s])*)\2)\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>
The Value content you're after is in Capture Group 3
https://regex101.com/r/HJhStT/1
https://regex101.com/r/8BWONb/1
Explained
< input # Input tag
(?= # Name attribute: Assert (a pseudo atomic group)
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
\s name \s* = \s* # name =
( ['"] ) # (1), Quote
\s* token \s* # token
\1
)
(?= # Value attribute
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
\s value \s* = \s* # value =
( ['"] ) # (2), Quote
( # (3 start), value content
(?:
(?! \2 )
[\S\s]
)*
) # (3 end)
\2
)
# Just get rest of tag
\s+
(?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
>

Match a pattern not preceded by a quotation mark

I have this pattern (?<!')(\w*)\((\d+|\w+|.*,*)\) that is meant to match strings like:
c(4)
hello(54, 41)
Following some answers on SO, I added a negative lookbehind so that if the input string is preceded by a ', the string shouldn't match at all. However, it still partially matches.
For example:
'c(4) returns (4) even though it shouldn't match anything because of the negative lookbehind.
How do I make it so if a string is preceded by ' NOTHING matches?
Since nobody came along, I'll throw this out to get you started.
This regex will match things like
aa(a , sd,,,f,)
aa( as , " ()asdf)) " ,, df, , )
asdf()
but not
'ab(s)
This will fix the basic problem (?<!['\w])\w*
Where (?<!['\w]) will not let the engine skip over a word char just
to satisfy the not quote.
Then the optional words \w* to grab all the words.
And if a 'aaa( quote is before it, then it won't match.
This regex here embellishes what I think you are trying to accomplish
in the function body part of your regex.
It might be a little overwhelming to understand at first.
(?s)(?<!['\w])(\w*)\(((?:,*(?&variable)(?:,+(?&variable))*[,\s]*)?)\)(?(DEFINE)(?<variable>(?:\s*(?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')\s*|[^()"',]+)))
Readable version (via: http://www.regexformat.com)
(?s) # Dot-all modifier
(?<! ['\w] ) # Not a quote, nor word behind
# <- This will force matching a complete function name
# if it exists, thereby blocking a preceding quote '
( \w* ) # (1), Function name (optional)
\(
( # (2 start), Function body
(?: # Parameters (optional)
,* # Comma (optional)
(?&variable) # Function call, get first variable (required)
(?: # More variables (optional)
,+ # Comma (required)
(?&variable) # Variable (required)
)*
[,\s]* # Whitespace or comma (optional)
)? # End parameters (optional)
) # (2 end)
\)
# Function definitions
(?(DEFINE)
(?<variable> # (3 start), Function for a single Variable
(?:
\s*
(?: # Double or single quoted string
"
[^"\\]*
(?: \\ . [^"\\]* )*
"
|
'
[^'\\]*
(?: \\ . [^'\\]* )*
'
)
\s*
| # or,
[^()"',]+ # Not quote, paren, comma (can be whitespace)
)
) # (3 end)
)

Golang regex replace excluding quoted strings

I'm trying to implement the removeComments function in Golang from this Javascript implementation. I'm hoping to remove any comments from the text. For example:
/* this is comments, and should be removed */
However, "/* this is quoted, so it should not be removed*/"
In the Javascript implementation, quoted matching are not captured in groups, so I can easily filter them out. However, in Golang, it seems it's not easy to tell whether the matched part is captured in a group or not. So how can I implement the same removeComments logic in Golang as the same in the Javascript version?
BACKGROUND
The correct way to do the task is to match and capture quoted strings (bearing in mind there can be escaped entities inside) and then matching the multiline comments.
REGEX IN-CODE DEMO
Here is the code to deal with that:
package main
import (
"fmt"
"regexp"
)
func main() {
reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*")|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`)
txt := `random text
/* removable comment */
"but /* never remove this */ one"
more random *text*`
fmt.Println(reg.ReplaceAllString(txt, "$1"))
}
See the Playground demo
EXPLANATION
The regex I suggest is written with the Best Regex Trick Ever concept in mind and consists of 2 alternatives:
("[^"\\]*(?:\\.[^"\\]*)*") - Double quoted string literal regex - Group 1 (see the capturing group formed with the outer pair of unescaped parentheses and later accessible via replacement backreferences) matching double quoted string literals that can contain escaped sequences. This part matches:
" - a leading double quote
[^"\\]* - 0+ characters other than " and \ (as [^...] construct is a negated character class that matches any characters but those defined inside it) (the * is a zero or more occurrences matching quantifier)
(?:\\.[^"\\]*)*" - 0+ sequences (see the last * and the non-capturing group used only to group subpatterns without forming a capture) of an escaped sequence (the \\. matches a literal \ followed with any character) followed with 0+
characters other than " and \
| - or
/\*[^*]*\*+(?:[^/*][^*]*\*+)*/ - multiline comment regex part matches *without forming a capture group (thus, unavailable from the replacement pattern via backreferences) and matches
/ - the / literal slash
\* - the literal asterisk
[^*]* - zero or more characters other than an asterisk
\*+ - 1 or more (the + is a one or more occurrences matching quantifier) asterisks
(?:[^/*][^*]*\*+)* - 0+ sequences (non-capturing, we do not use it later) of any character but a / or * (see [^/*]), followed with 0+ characters other than an asterisk (see [^*]*) and then followed with 1+ asterisks (see \*+).
/ - a literal (trailing, closing) slash.
NOTE: This multiline comment regex is the fastest I have ever tested. Same goes for the double quoted literal regex as "[^"\\]*(?:\\.[^"\\]*)*" is written with the unroll-the-loop technique in mind: no alternations, only character classes with * and + quantifiers are used in a specific order to allow the fastest matching.
NOTES ON PATTERN ENHANCEMENTS
If you plan to extend to matching single quoted literals, there is nothing easier, just add another alternative into the 1st capture group by re-using the double quoted string literal regex and replacing the double quotes with single ones:
reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`)
^-------------------------^
Here is the single- and double-quoted literal supporting regex demo removing the miltiline comments
Adding a single line comment support is similar: just add //[^\n\r]* alternative to the end:
reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*[\r\n]*`)
^-----------^
Here is single- and double-quoted literal supporting regex demo removing the miltiline and singleline comments
These do not preserve formatting
Preferred way (produces a NULL if group 1 is not matched)
works in golang playground -
# https://play.golang.org/p/yKtPk5QCQV
# fmt.Println(reg.ReplaceAllString(txt, "$1"))
# (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*(?:\n|$))|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|[\S\s][^/"'\\]*)
(?: # Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// [^\n]* # Start // comment
(?: \n | $ ) # End // comment
)
|
( # (1 start), Non - comments
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
|
'
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape, or line continuation (escape + newline)
) # (1 end)
Alternative way (group 1 is always matched, but could be empty)
works in golang playground -
# https://play.golang.org/p/7FDGZSmMtP
# fmt.Println(reg.ReplaceAllString(txt, "$1"))
# (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*(?:\n|$))?((?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|[\S\s][^/"'\\]*)?)
(?: # Comments
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
|
// [^\n]* # Start // comment
(?: \n | $ ) # End // comment
)?
( # (1 start), Non - comments
(?:
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
|
'
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
| [\S\s] # Any other char
[^/"'\\]* # Chars which doesn't start a comment, string, escape, or line continuation (escape + newline)
)?
) # (1 end)
The Cadilac - Preserves Formatting
(Unfortunately, this can't be done in Golang because Golang cannot do Assertions)
Posted incase you move to a different regex engine.
# raw: ((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n|[\S\s])[^/"'\\\s]*)
# delimited: /((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n|[\S\s])[^\/"'\\\s]*)/
( # (1 start), Comments
(?:
(?: ^ [ \t]* )? # <- To preserve formatting
(?:
/\* # Start /* .. */ comment
[^*]* \*+
(?: [^/*] [^*]* \*+ )*
/ # End /* .. */ comment
(?: # <- To preserve formatting
[ \t]* \r? \n
(?=
[ \t]*
(?: \r? \n | /\* | // )
)
)?
|
// # Start // comment
(?: # Possible line-continuation
[^\\]
| \\
(?: \r? \n )?
)*?
(?: # End // comment
\r? \n
(?= # <- To preserve formatting
[ \t]*
(?: \r? \n | /\* | // )
)
| (?= \r? \n )
)
)
)+ # Grab multiple comment blocks if need be
) # (1 end)
| ## OR
( # (2 start), Non - comments
"
[^"\\]* # Double quoted text
(?: \\ [\S\s] [^"\\]* )*
"
|
'
[^'\\]* # Single quoted text
(?: \\ [\S\s] [^'\\]* )*
'
|
(?: \r? \n | [\S\s] ) # Linebreak or Any other char
[^/"'\\\s]* # Chars which doesn't start a comment, string, escape,
# or line continuation (escape + newline)
) # (2 end)
I've never read/written anything in Go, so bear with me. Fortunately, I know regex. I did a little research on Go regexes, and it would seem that they lack most modern features (such as references).
Despite that, I've developed a regex that seems to be what you're looking for. I'm assuming that all strings are single line. Here it is:
reg := regexp.MustCompile(`(?m)^([^"\n]*)/\*([^*]+|(\*+[^/]))*\*+/`)
txt := `random text
/* removable comment */
"but /* never remove this */ one"
more random *text*`
fmt.Println(reg.ReplaceAllString(txt, "${1}"))
Variation: The version above will not remove comments that happen after quotation marks. This version will, but it may need to be run multiple times.
reg := regexp.MustCompile(
`(?m)^(([^"\n]*|("[^"\n]*"))*)/\*([^*]+|(\*+[^/]))*\*+/`
)
txt := `
random text
what /* removable comment */
hi "but /* never remove this */ one" then /*whats here*/ i don't know /*what*/
more random *text*
`
newtxt := reg.ReplaceAllString(txt, "${1}")
fmt.Println(newtxt)
newtxt = reg.ReplaceAllString(newtxt, "${1}")
fmt.Println(newtxt)
Explanation
(?m) means multiline mode. Regex101 gives a nice explanation of this:
The ^ and $ anchors now match at the beginning/end of each line respectively, instead of beginning/end of the entire string.
It needs to be anchored to the beginning of each line (with ^) to ensure a quote hasn't started.
The first regex has this: [^"\n]*. Essentially, it's matching everything that's not " or \n. I've added parenthesis because this stuff isn't comments, so it needs to be put back.
The second regex has this: (([^"\n]*|("[^"\n]*"))*). The regex, with this statement can either match [^"\n]* (like the first regex does), or (|) it can match a pair of quotes (and the content between them) with "[^"\n]*". It's repeating so it works when there are more than one quote pair, for example. Note that, like the simpler regex, this non-comment stuff is being captured.
Both regexes use this: /\*([^*]+|(\*+[^/]))*\*+/. It matches /* followed by any amount of either:
[^*]+ Non * chars
or
\*+[^/] One or more *s that are not followed by /.
And then it matches the closing */
During replacement, the ${1} refers to the non-comment things that were captured, so they're reinserted into the string.
Just for fun another approach, minimal lexer implemented as state machine, inspired by and well described in Rob Pike talk http://cuddle.googlecode.com/hg/talk/lex.html. Code is more verbose but more readable, understandable and hackable then regexp. Also it can work with any Reader and Writer, not strings only so don't consumes RAM and should even be faster.
type stateFn func(*lexer) stateFn
func run(l *lexer) {
for state := lexText; state != nil; {
state = state(l)
}
}
type lexer struct {
io.RuneReader
io.Writer
}
func lexText(l *lexer) stateFn {
for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
switch r {
case '"':
l.Write([]byte(string(r)))
return lexQuoted
case '/':
r, _, err = l.ReadRune()
if r == '*' {
return lexComment
} else {
l.Write([]byte("/"))
l.Write([]byte(string(r)))
}
default:
l.Write([]byte(string(r)))
}
}
return nil
}
func lexQuoted(l *lexer) stateFn {
for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
if r == '"' {
l.Write([]byte(string(r)))
return lexText
}
l.Write([]byte(string(r)))
}
return nil
}
func lexComment(l *lexer) stateFn {
for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
if r == '*' {
r, _, err = l.ReadRune()
if r == '/' {
return lexText
}
}
}
return nil
}
You can see it works http://play.golang.org/p/HyvEeANs1u
Demo
Play golang demo
(The workings at each stage are output and the end result can be seen by scrolling down.)
Method
A few "tricks" are used to work around Golang's somewhat limited regex syntax:
Replace start quotes and end quotes with a unique character. Crucially, the characters used to identify start and end quotes must be different from each other and extremely unlikely to appear in the text being processed.
Replace all comment starters (/*) that are not preceeded by an unterminated start quote with a unique sequence of one or more characters.
Similarly, replace all comment enders (*/) that are not succeeded by an end quote that does not have a start quote before it with a different unique sequence of one or more characters.
Remove all remaining /*...*/ comment sequences.
Unmask the previously "masked" comment starters/enders by reversing the replacements made in steps 2 and 3 above.
Limitations
The current demo doesn't address the possibility of a double quote appearing within a comment, e.g. /* Not expected: " */. Note: My feeling is this could be handled - just haven't put the effort in yet - so let me know if you think it could be an issue and I'll look into it.
Try this example..
play golang

Regex to match specific functions and their arguments in files

I'm working on a gettext javascript parser and I'm stuck on the parsing regex.
I need to catch every argument passed to a specific method call _n( and _(. For example, if I have these in my javascript files:
_("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
This refs this documentation: http://poedit.net/trac/wiki/Doc/Keywords
I'm planning in doing it in two times (and two regex):
catch all function arguments for _n( or _( method calls
catch the stringy ones only
Basically, I'd like a Regex that could say "catch everything after _n( or _( and stop at the last parenthesis ) actually when the function is done. I dunno if it is possible with regex and without a javascript parser.
What could also be done is "catch every "string" or 'string' after _n( or _( and stop at the end of the line OR at the beginning of a new _n( or _( character.
In everything I've done I get either stuck on _( "one (optional)" ); with its inside parenthesis or apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) with two calls on the same line.
Here is what I implemented so far, with un-perfect regex: a generic parser and the javascript one or the handlebars one
Note: Read this answer if you're not familiar with recursion.
Part 1: match specific functions
Who said that regex can't be modular? Well PCRE regex to the rescue!
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
_n? # Match _ or _n
\s* # Optional white spaces
(?P<results>(?&brackets)) # Recurse/use the brackets pattern and put it in the results group
~sx
The s is for matching newlines with . and the x modifier is for this fancy spacing and commenting of our regex.
Online regex demo
Online php demo
Part 2: getting rid of opening & closing brackets
Since our regex will also get the opening and closing brackets (), we might need to filter them. We will use preg_replace() on the results:
~ # Delimiter
^ # Assert begin of string
\( # Match an opening bracket
\s* # Match optional whitespaces
| # Or
\s* # Match optional whitespaces
\) # Match a closing bracket
$ # Assert end of string
~x
Online php demo
Part 3: extracting the arguments
So here's another modular regex, you could even add your own grammar:
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<array>
Array\s*
(?&brackets)
)
(?P<variable>
[^\s,()]+ # I don't know the exact grammar for a variable in ECMAScript
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
(?&array) # Recurse/use the array pattern
| # Or
(?&variable) # Recurse/use the array pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&variable)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
~xis
We will loop and use preg_match_all(). The final code would look like this:
$functionPattern = <<<'regex'
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
_n? # Match _ or _n
\s* # Optional white spaces
(?P<results>(?&brackets)) # Recurse/use the brackets pattern and put it in the results group
~sx
regex;
$argumentsPattern = <<<'regex'
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<array>
Array\s*
(?&brackets)
)
(?P<variable>
[^\s,()]+ # I don't know the exact grammar for a variable in ECMAScript
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
(?&array) # Recurse/use the array pattern
| # Or
(?&variable) # Recurse/use the array pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
|
(?&variable)
~six
regex;
$input = <<<'input'
_ ("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
// misleading cases
_n("foo (")
_n("foo (\)", 'foo)', aa)
_n( Array(1, 2, 3), Array(")", '(') );
_n(function(foo){return foo*2;}); // Is this even valid?
_n (); // Empty
_ (
"Foo",
'Bar',
Array(
"wow",
"much",
'whitespaces'
),
multiline
); // PCRE is awesome
input;
if(preg_match_all($functionPattern, $input, $m)){
$filtered = preg_replace(
'~ # Delimiter
^ # Assert begin of string
\( # Match an opening bracket
\s* # Match optional whitespaces
| # Or
\s* # Match optional whitespaces
\) # Match a closing bracket
$ # Assert end of string
~x', // Regex
'', // Replace with nothing
$m['results'] // Subject
); // Getting rid of opening & closing brackets
// Part 3: extract arguments:
$parsedTree = array();
foreach($filtered as $arguments){ // Loop
if(preg_match_all($argumentsPattern, $arguments, $m)){ // If there's a match
$parsedTree[] = array(
'all_arguments' => $arguments,
'branches' => $m[0]
); // Add an array to our tree and fill it
}else{
$parsedTree[] = array(
'all_arguments' => $arguments,
'branches' => array()
); // Add an array with empty branches
}
}
print_r($parsedTree); // Let's see the results;
}else{
echo 'no matches';
}
Online php demo
You might want to create a recursive function to generate a full tree. See this answer.
You might notice that the function(){} part isn't parsed correctly. I will let that as an exercise for the readers :)
Try this:
(?<=\().*?(?=\s*\)[^)]*$)
See live demo
Below regex should help you.
^(?=\w+\()\w+?\(([\s'!\\\)",\w]+)+\);
Check the demo here
\(( |"(\\"|[^"])*"|'(\\'|[^'])*'|[^)"'])*?\)
This should get anything between a pair of parenthesis, ignoring parenthesis in quotes.
Explanation:
\( // Literal open paren
(
| //Space or
"(\\"|[^"])*"| //Anything between two double quotes, including escaped quotes, or
'(\\'|[^'])*'| //Anything between two single quotes, including escaped quotes, or
[^)"'] //Any character that isn't a quote or close paren
)*? // All that, as many times as necessary
\) // Literal close paren
No matter how you slice it, regular expressions are going to cause problems. They're hard to read, hard to maintain, and highly inefficient. I'm unfamiliar with gettext, but perhaps you could use a for loop?
// This is just pseudocode. A loop like this can be more readable, maintainable, and predictable than a regular expression.
for(int i = 0; i < input.length; i++) {
// Ignoring anything that isn't an opening paren
if(input[i] == '(') {
String capturedText = "";
// Loop until a close paren is reached, or an EOF is reached
for(; input[i] != ')' && i < input.length; i++) {
if(input[i] == '"') {
// Loop until an unescaped close quote is reached, or an EOF is reached
for(; (input[i] != '"' || input[i - 1] == '\\') && i < input.length; i++) {
capturedText += input[i];
}
}
if(input[i] == "'") {
// Loop until an unescaped close quote is reached, or an EOF is reached
for(; (input[i] != "'" || input[i - 1] == '\\') && i < input.length; i++) {
capturedText += input[i];
}
}
capturedText += input[i];
}
capture(capturedText);
}
}
Note: I didn't cover how to determine if it's a function or just a grouping symbol. (ie, this will match a = (b * c)). That's complicated, as is covered in detail here. As your code gets more and more accurate, you get closer and closer to writing your own javascript parser. You might want to take a look at the source code for actual javascript parsers if you need that sort of accuracy.
One bit of code (you can test this PHP code at http://writecodeonline.com/php/ to check):
$string = '_("foo")
_n("bar", "baz", 42);
_n(domain, "bux", var);
_( "one (optional)" );
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples)';
preg_match_all('/(?<=(_\()|(_n\())[\w", ()%]+(?=\))/i', $string, $matches);
foreach($matches[0] as $test){
$opArr = explode(',', $test);
foreach($opArr as $test2){
echo trim($test2) . "\n";
}
}
you can see the initial pattern and how it works here: http://regex101.com/r/fR7eU2/1
Output is:
"foo"
"bar"
"baz"
42
domain
"bux"
var
"one (optional)"
"No apples"
"%1 apple"
"%1 apples"
apples
We can do this in two steps:
1)catch all function arguments for _n( or _( method calls
(?:_\(|_n\()(?:[^()]*\([^()]*\))*[^()]*\)
See demo.
http://regex101.com/r/oE6jJ1/13
2)catch the stringy ones only
"([^"]*)"|(?:\(|,)\s*([^"),]*)(?=,|\))
See demo.
http://regex101.com/r/oE6jJ1/14