Regex Match % but not \% - regex

I'm struggling to find the right regex for the case where I want to match a '%' but only if it's not preceded by a '\' between quotations.
For example I want this to come back as a match
Test \" this % matches \" test
But not match
Test \" this \% doesn't match \" test
Would a regex master be willing to assist me with this?!
Ultimately my goal is to ensure every '%' is escaped when found within quotations.
Edit:
Here's what I have right now
This is currently what I have but definitely isn't correct.
\".[%][^\%].\"

([^\\]|\\[^%])*
It looks like this seems to work in my tests on https://regex101.com/
The sections are ( [^\\] | \\[^%] )*
The ()* means 0 or more of the contained group.
The contained group is either [^\\] or \\[^%]. The first case is "any character that is not a backslash," which include the percent sign. The second case is "a backslash followed by any character that is not a percent sign."
The [^ ] operator is "any character except these."

Related

Replace "advanced" pattern in sed

I cant figure out how to change this:
\usepackage{scrpage2}
\usepackage{pgf} \usepackage[latin1]{inputenc}\usepackage{times}\usepackage[T1]{fontenc}
\usepackage[colorlinks,citecolor=black,filecolor=black,linkcolor=black,urlcolor=black]{hyperref}
to this using sed only
REPLACED
REPLACED REPLACEDREPLACEDREPLACED
REPLACED
Im trying stuff like sed 's!\\.*\([.*]\)\?{.\+}!REPLACED!g' FILE
but that gives me
REPLACED
REPLACED
REPLACED
I think .* gets used and everything else in my pattern is just ignored, but I can't figure out how to go about this.
After I learned how to format a regex like that, my next step would be to change it to this:
\usepackage{scrpage2}
\usepackage{pgf}
\usepackage[latin1]{inputenc}
\usepackage{times}
\usepackage[T1]{fontenc}
\usepackage[colorlinks,citecolor=black,filecolor=black,linkcolor=black,urlcolor=black]{hyperref}
So I would appreciate any pointers in that direction too.
Here's some code that happens to work for the example you gave:
sed 's/\\[^\\[:space:]]\+/REPLACED/g'
I.e. match a backslash followed by one or more characters that are not whitespace or another backslash.
To make things more specific, you can use
sed 's/\\[[:alnum:]]\+\(\[[^][]*\]\)\?{[^{}]*}/REPLACED/g'
I.e. match a backslash followed by one or more alphanumeric characters, followed by an optional [ ] group, followed by a { } group.
The [ ] group matches [, followed by zero or more non-bracket characters, followed by ].
The { } group matches {, followed by zero or more non-brace characters, followed by }.
Perl to the rescue! It features the "frugal quantifiers":
perl -pe 's!\\.*?\.?{.+?}!REPLACED!g' FILE
Note that I removed the capturing group as you didn't use it anywhere. Also, [.*] matches either a dot or an asterisk, but you probably wanted to match a literal dot instead.

regex for first instance of a specific character that DOESN'T come immediately after another specific character

I have a function, translate(), takes multiple parameters. The first param is the only required and is a string, that I always wrap in single quotes, like this:
translate('hello world');
The other params are optional, but could be included like this:
translate('hello world', true, 1, 'foobar', 'etc');
And the string itself could contain escaped single quotes, like this:
translate('hello\'s world');
To the point, I now want to search through all code files for all instances of this function call, and extract just the string. To do so I've come up with the following grep, which returns everything between translate(' and either ') or ',. Almost perfect:
grep -RoPh "(?<=translate\(').*?(?='\)|'\,)" .
The problem with this though, is that if the call is something like this:
translate('hello \'world\', you\'re great!');
My grep would only return this:
hello \'world\
So I'm looking to modify this so that the part that currently looks for ') or ', instead looks for the first occurrence of ' that hasn't been escaped, i.e. doesn't immediately follow a \
Hopefully I'm making sense. Any suggestions please?
You can use this grep with PCRE regex:
grep -RoPh "\btranslate\(\s*\K'(?:[^'\\\\]*)(?:\\\\.[^'\\\\]*)*'" .
Here is a regex demo
RegEx Breakup:
\b # word boundary
translate # match literal translate
\( # match a (
\s* # match 0 or more whitespace
\K # reset the matched information
' # match starting single quote
(?: # start non-capturing group
[^'\\\\]* # match 0 or more chars that are not a backslash or single quote
) # end non-capturing group
(?: # start non-capturing group
\\\\. # match a backslash followed by char that is "escaped"
[^'\\\\]* # match 0 or more chars that are not a backslash or single quote
)* # end non-capturing group
' # match ending single quote
Here is a version without \K using look-arounds:
grep -oPhR "(?<=\btranslate\(')(?:[^'\\\\]*)(?:\\\\.[^'\\\\]*)*(?=')" .
RegEx Demo 2
I think the problem is the .*? part: the ? makes it a non-greedy pattern, meaning it'll take the shortest string that matches the pattern. In effect, you're saying, "give me the shortest string that's followed by quote+close-paren or quote+comma". In your example, "world\" is followed by a single quote and a comma, so it matches your pattern.
In these cases, I like to use something like the following reasoning:
A string is a quote, zero or more characters, and a quote: '.*'
A character is anything that isn't a quote (because a quote terminates the string): '[^']*'
Except that you can put a quote in a string by escaping it with a backslash, so a character is either "backslash followed by a quote" or, failing that, "not a quote": '(\\'|[^'])*'
Put it all together and you get
grep -RoPh "(?<=translate\(')(\\'|[^'])*(?='\)|'\,)" .

Regex to match a string ignoring \"

I current have this regex
"[^"]*"
I am testing it againts this string (i am using http://regexpal.com/ so it has not been string encoded yet!)
"This is a test \"Text File\"" "This is a test \"Text File\""
Currently it is matching
"This is a test \"
""
"This is a test \"
""
I would like it have the following matches
"This is a test \"Text File\""
"This is a test \"Text File\""
Basicly I want it to match something that starts with " and ends with " but ignore anything in the middle that is \". What do i need to add to my regex to acheive this?
Thanks in advance
Then best way of doing this depends on the matching capabilities are of your regex engine (many of them have varying support for various features). For just a bare-bones regex engine that does not support any kind of look-behind capabilities, this is what you want: "([^"]*\\")*[^"]*"
This will match a quote, followed by zero or more pairs of non-quote sequences and \" sequences, followed by a required non-quote sequence, and finally a final quote.
(\\"|[^"])+
will match \" as well as any character that is not "
Regex for DART:
RegExp exp = new RegExp(r"(".*?"")");
http://regex101.com/r/hM5pI7
EXPLANATION:
Match the regular expression below and capture its match into backreference number 1 «(".*?"")»
Match the character “"” literally «"»
Match any single character that is not a line break character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the characters “""” literally «""»

Regex for matching literal strings

I'm trying to write a regular expression which will match a string. For simplicity, I'm only concerned with double quote (") strings for the moment.
So far I have this: "\"[^\"]*\""
This works for most strings but fails when there is an escaped double quote such as this:
"a string \" with an escaped quote"
In this case, it only matches up to the escaped quote.
I've tried several things to allow an escaped quote but so far I've been unsuccessful, can anyone give me a hand?
I've managed to solve it myself:
"\"(\\.|[^\"\\])*\""
Try this:
"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*"
If you want a multi-line escaped string you can use:
"[^"\\]*(?:\\.[^"\\]*)*"
You need a negative lookbehind. Check if this works?
"\"[^\"]*(?<!\\)"
(?<!\\)" is supposed to match " that's not followed by \.
Try:
"((\\")|[^"(\\")])+"
From Regular Expression Library.
Usually you want to accept escaped anything.
" [^"\\]* (?: \\. [^"\\]* )* " would be the fastest.
"[^"\\]*(?:\\.[^"\\]*)*" compressed.
POSIX does not, AFAIK, support lookaround - without it, there is really no way to do this with just regular expressions. However, according to a POSIX emulator I have (no access to a native environment or library), This might get you close, in certain cases:
"[^\"]*"|"[^\]*\\|\\[^\"]*[\"]
it will capture the part before and the part after the escaped quote... with this source string (ignore the line breaks, an imagine it's all in one string):
I want to match "this text" and "This text, where there is an escaped
slash (\\), and an \"escaped quote\" (\")", but I also want to handle\\ escaped
back-slashes, as in "this text, with a \\ backslash: \\" -- with a little
text behind it!
it will capture these groups:
"this text" -- simple, quoted string
"This text, where there is an escaped slash (\ -- part 1 of quoted string
\), and an \ -- part 2
"escaped quote\ -- part 3
" (\ -- part 4
")" -- part 5, and ends with a quote
\\ -- not part of a quoted string
"this text, with a \ -- part 1 of quoted string
\ backslash: \ -- part 2
\" -- part 3, and ends with a quote
With further analysis you can combine them, as appropriate:
If the group starts and ends with a ", then it is fine on its own
If the group starts with a ", and ends with a \, then it needs to be IMMEDIATELY followed by another match group that either ends with a quote character itself, or recursively continues to be IMMEDIATELY followed by another match group
If the group does not immediately follow another match, it is not part of a quoted string
I think that's all the analysis that you need - but make sure to test it!!!
Let me know if this idea helps!
EDIT:
Additional note: just to be clear, for this to work all quotes in the entire source string must be escaped if they are not to be used as delimiters, and backslashes must be escaped everywhere as well

Why does this regular expression match?

I have this regex:
(?<!Sub ).*\(.*\)
And I'd like it to match this:
MsgBox ("The total run time to fix AREA and TD fields is: " & =imeElapsed & " minutes.")
But not this:
Sub ChangeAreaTD()
But somehow I still match the one that starts with Sub... does anyone have any idea why? I thought I'd be excluding "Sub " by doing
(?<!Sub )
Any help is appreciated!
Thanks.
Do this:
^MsgBox .*\(.*\)
The problem is that a negative lookbehind does not guarantee the beginning of a string. It will match anywhere.
However, adding a ^ character at the beginning of the regex does guarantee the beginning of the string. Then, change Sub to MsgBox so it only matches strings that begin with MsgBox
Your regex (?<!Sub ).*\(.*\), taken apart:
(?<! # negative look-behind
Sub # the string "Sub " must not occur before the current position
) # end negative look-behind
.* # anything ~ matches up to the end of the string!
\( # a literal "(" ~ causes the regex to backtrack to the last "("
.* # anything ~ matches up to the end of the string again!
\) # a literal ")" ~ causes the regex to backtrack to the last ")"
So, with your test string:
Sub ChangeAreaTD()
The look-behind is fulfilled immediately (right at position 0).
The .* travels to the end of the string after that.
Because of this .*, the look-behind never really makes a difference.
You were probably thinking of
(?<!Sub .*)\(.*\)
but it is very unlikely that variable-length look-behind is supported by your regex engine.
So what I would do is this (since variable-length look-ahead is widely supported):
^(?!.*\bSub\b)[^(]+\(([^)]+)\)
which translates as:
^ # At the start of the string,
(?! # do a negative look-ahead:
.* # anything
\b # a word boundary
Sub # the string "Sub"
\b # another word bounday
) # end negative look-ahead. If not found,
[^(]+ # match anything except an opening paren ~ to prevent backtracking
\( # match a literal "("
( # match group 1
[^)]+ # match anything up to a closing paren ~ to prevent backtracking
) # end match group 1
\) # match a literal ")".
and then go for the contents of match group 1.
However, regex generally is hideously ill-suited for parsing code. This is true for HTML the same way it is true for VB code. You will get wrong matches even with the improved regex. For example here, because of the nested parens:
MsgBox ("The total run time to fix all fields (AREA, TD) is: ...")
You have a backtracking problem here. The first .* in (?<!Sub ).*\(.*\) can match ChangeAreaTD or hangeAreaTD. In the latter case, the previous 4 characters are ub C, which does not match Sub. As the lookbehind is negated, this counts as a match!
Just adding a ^ to the beginning of your regex will not help you, as look-behind is a zero-length matching phrase. ^(?<!MsgBox ) would be looking for a line that followed a line ending in MsgBox. What you need to do instead is ^(?!Sub )(.*\(.*\)). This can be interpreted as "Starting at the beginning of a string, make sure it does not start with Sub. Then, capture everything in the string if it looks like a method call".
A good explanation of how regex engines parse lookaround can be found here.
If your wanting to match just the functions call, not declaration, then the pre bracket match should not match any characters, but more likely any identifier characters followed by spaces. Thus
(?<!Sub )[a-zA-Z][a-zA-Z0-9_]* *\(.*\)
The identifier may need more tokens depending on the language your matching.