Regular expression to match strings for syntax highlighter - regex

I'm looking for a regular expression that matches strings for a syntax highlighter used in a code editor. I've found
(")(?:(?!\1|\\).|\\.)*\1
from here regex-grabbing-values-between-quotation-marks (I've changed the beginning since I only need double quotes, no single quotes)
The above regular expression correctly matches the following example having escaped double quotes and escaped backslashes
"this is \" just a test\\"
Most code editors however also highlight open ended strings such as the following example
"this must \" match\\" this text must not be matched "this text must be matched as well
Is it possible to alter the above regular expression to also match the open ended string? Another possibility would be a second regular expression that just matches the open ended string such as
"[^"]*$ but match only if preceded by an even count of non-escaped quotes

You could use an alternation to match either a backreference to group 1 or assert the end of the string with your current pattern.
(")(?:(?!\1|\\).|\\.)*(?:\1|$)
But as you are only capturing a single character (") you can omit the capture group and instead of the backreference \1 just match "
Alternatively written pattern:
"[^"\\]*(?:\\.[^"\\]*)*(?:"|$)
See a regex demo.
If the match should not start with \" and a lookbehind is supported:
(?<!\\)"[^"\\]*(?:\\.[^"\\]*)*(?:"|$)
This pattern matches:
(?<!\\) Negative lookbehind, assert not \ directly to the left
" Match the double quote
[^"\\]* Optionally match any char except " or \
(?:\\.[^"\\]*)* Optionally repeat matching \ and any char followed by any char except " or \
(?:"|$) Match either " or assert the end of the string.

Related

why below regular expression doesn't match the string?

I am trying to match the below string with Regular expression
string : PKGx.1234 ... BBA
Regular expression : ^\bPKG[0-9]{0,1}.[0-9]{0,4}\ ...\ \bBB[A-B]{1}?$
but i am getting no match error
can anyone help me with how can i remodify the regular expression to match the given string ..?
You have a character x after PKG that the pattern tries to match with an optional digit [0-9]? If there should be an optional single a lowercase character, you can use [a-z]? instead.
You can omit the word boundary before the BB as there is an implicit word boundary in between.
Note that you don't have escape the spaces, but you do have to escape the dot to match it literally.
^PKG[a-z]?\.[0-9]{0,4} \.{3} BB[A-B]\b
Regex demo
If you want to match the whole string including the space and comma at the end, including using the $ anchor to assert the end of the string:
^PKG[a-z]?\.[0-9]{0,4} \.{3} BB[A-B] , *$
Regex demo

Trying to match string A if string B is found anywhere before it

What I'm trying to do is, if a string consists of some substring that starts with "!" encapsulated in "[" and "]", to separate those brackets from the rest of the string via a space, e.g. "[!foo]" --> "[ !foo ]", "[!bar]" --> "[ !bar ]", etc. Since that substring can be variable length, I figured this had to be done with regex. My thought was to do this in two steps - first separate the first bracket, then separate the second bracket.
The first one isn't hard; the regex is just \[! and so I can just do str = str.replace(/\[!/g, "[ !"); in Javascript. It's the second part I can't get to work.
Because now, I need to match "]" if the string literal "[ !" is found anywhere before it. So a simple positive lookbehind doesn't match because it only looks directly behind: (?<=\Q[ !\E)\] doesn't match.
And I still don't understand why, but I'm not allowed to make the positive lookbehind non-fixed length; (?<=\Q[ !\E.*)\] throws the error Syntax Error: Invalid regular expression: missing / in the console, and this regex debugger yields a pattern error explaining "A quantifier inside a lookbehind makes it non-fixed width".
Putting a non-capturing group of non-fixed width between the lookbehind and the capturing group doesn't work; (?<=\Q[ !\E)(?:.*)\] doesn't match.
One thing that won't work is just trying to match "[ !" at the start of the string, because this whole "[!foo]" string is actually itself a substring of an even bigger string and isn't at the beginning.
What am I missing?
Using 2 positive lookarounds, you can assert what is on the left is an opening square bracket (?<=\[)
Then match any char except ] using a negated character class ![^[\]]+ preceded by an exclamation mark and assert what is on the right is a closing square bracket using (?=])
Note that in Javascript the lookbehind is not yet widely supported.
(?<=\[)![^[\]]+(?=])
In the replacement use the matched substring $&
Regex demo
[
"[!foo]",
"[!bar]"
].forEach(s =>
console.log(s.replace(/(?<=\[)![^[\]]+(?=])/g, " $& "))
)
Or you could also use 3 capturing groups instead:
(\[)(![^\]]+)(\])
In the replacement use
$1 $2 $3
Regex demo
[
"[!foo]",
"[!bar]"
].forEach(s =>
console.log(s.replace(/(\[)(![^\]]+)(\])/g, "$1 $2 $3"))
)
You can use this regex: \[!([^]]+)\] with this substitution string [! \1 ].
Explanation:
The regex:
\[!: match begins with [!
([^]]+): capture in group 1 all the characters that are not ]
\]: match ]
The substitution: substitute the full match with [!{contents of group 1}].
Regex Demo
I hope it helps.

replace single-quote with double-quote, if and only if quote is after specific string

I'm working in notepad++, and using its find-replace dialog box.
NP++ documentation states: Notepad++ regular expressions use the Boost regular expression library v1.70, which is based on PCRE (Perl Compatible Regular Expression) syntax. ref: https://npp-user-manual.org/docs/searching
What I'm trying to do should be simple, but I'm a regex novice, and after 2-3 hrs of web searches and playing with online regex testers, I give up.
I want to replace all single quotes ' with double quote " , but if and only if the ' is to the RIGHT of one or more #, ie inside a python comment.
For example,
list1 = ['apple','banana','pear'] # All 'single quotes' to LEFT of # remained unchanged.
list2 = ['tomato','carrot'] # All 'single quotes' to RIGHT of one or more # are replaced
# # with "double quotes", like this.
The np++ file is over 800 lines, manual replacement would be tedious & error prone. Advice appreciated.
This regex should do what you want:
(^[^#]*#|(?<!^)\G)[^'\n]*\K'
It looks for a ' which is preceded by either
^[^#]*# : start of line and some number of non-# characters followed by a #; or
(?<!^)\G : the start of line or the end of the previous match (\G), with a negative lookbehind for start of line (?<!^), meaning that it only matches at the end of the previous match
and then some number of non ' or newline (to prevent the match wrapping around the end of the previous line) characters [^'\n]*.
We then use \K to reset the match, so that everything before that is discarded from the match, and the regex only matches the '.
That can then be replaced with ".
Demo on regex101
Update
You can avoid matching apostrophes within words by only matching ones that are either preceded or followed by a non-word character:
(^[^#]*#|(?<!^)\G)[^'\n]*\K('(?=\W)|(?<=\W)')
Demo on regex101
Update 2
You can also deal with the case where there are # characters in strings by qualifying the first part of the regex with the requirement for there to be matched pairs of quotes beforehand:
(?:^[^'#]*(?:'[^']*'[^#']*)*[^'#]*#|(?<!^)\G)[^'\n]*\K(?:'(?=\W)|(?<=\W)')
Demo on regex101

Regular expression that matches between quotes

I want to write a regular expression that match string in quotes except quotes in my quotes.
For example:
My string:
"Good programm\",\"pls help"
I want to get:
Good programm\",\"pls help
Try (?<=").*(?=") check online: http://regexr.com?349d2
As long as you don't have nested structures you can try this:
(?<=")(?:[^"]|(?<=\\)")*(?=")
See it here on Regexr
(?<=") positive lookbehind assertion, ensures there is a " before the match (Try if it is working for you, in Regexr it is.)
(?:[^"]|(?<=\\)") Alternation: matches either a character that is not a ", or a " that is escaped (ensured by the lookbehind (?<=\\)).
* The character from the alternation is matches 0 or more times.
(?=") positive lookahead assertion, ensures there is a " after the match
But be careful: It matches across newlines and also between escaped ", when there are no non escaped quotes available.
Regexr

Regex for matching literal strings

I'm trying to write a regular expression which will match a string. For simplicity, I'm only concerned with double quote (") strings for the moment.
So far I have this: "\"[^\"]*\""
This works for most strings but fails when there is an escaped double quote such as this:
"a string \" with an escaped quote"
In this case, it only matches up to the escaped quote.
I've tried several things to allow an escaped quote but so far I've been unsuccessful, can anyone give me a hand?
I've managed to solve it myself:
"\"(\\.|[^\"\\])*\""
Try this:
"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*"
If you want a multi-line escaped string you can use:
"[^"\\]*(?:\\.[^"\\]*)*"
You need a negative lookbehind. Check if this works?
"\"[^\"]*(?<!\\)"
(?<!\\)" is supposed to match " that's not followed by \.
Try:
"((\\")|[^"(\\")])+"
From Regular Expression Library.
Usually you want to accept escaped anything.
" [^"\\]* (?: \\. [^"\\]* )* " would be the fastest.
"[^"\\]*(?:\\.[^"\\]*)*" compressed.
POSIX does not, AFAIK, support lookaround - without it, there is really no way to do this with just regular expressions. However, according to a POSIX emulator I have (no access to a native environment or library), This might get you close, in certain cases:
"[^\"]*"|"[^\]*\\|\\[^\"]*[\"]
it will capture the part before and the part after the escaped quote... with this source string (ignore the line breaks, an imagine it's all in one string):
I want to match "this text" and "This text, where there is an escaped
slash (\\), and an \"escaped quote\" (\")", but I also want to handle\\ escaped
back-slashes, as in "this text, with a \\ backslash: \\" -- with a little
text behind it!
it will capture these groups:
"this text" -- simple, quoted string
"This text, where there is an escaped slash (\ -- part 1 of quoted string
\), and an \ -- part 2
"escaped quote\ -- part 3
" (\ -- part 4
")" -- part 5, and ends with a quote
\\ -- not part of a quoted string
"this text, with a \ -- part 1 of quoted string
\ backslash: \ -- part 2
\" -- part 3, and ends with a quote
With further analysis you can combine them, as appropriate:
If the group starts and ends with a ", then it is fine on its own
If the group starts with a ", and ends with a \, then it needs to be IMMEDIATELY followed by another match group that either ends with a quote character itself, or recursively continues to be IMMEDIATELY followed by another match group
If the group does not immediately follow another match, it is not part of a quoted string
I think that's all the analysis that you need - but make sure to test it!!!
Let me know if this idea helps!
EDIT:
Additional note: just to be clear, for this to work all quotes in the entire source string must be escaped if they are not to be used as delimiters, and backslashes must be escaped everywhere as well