Capture text in quotes immediately before keyword - regex

I have an input stream that looks like this:
"ignore this" blah "ignore this" blah "capture this" keyword "ignore this" blah
I want to capture capture this, i.e. the text in quotes before keyword.
I tried the regex (?:\"(.*)\" )(?=keyword), but this captures everything up to the quotation mark before keyword. How would I capture the text in quotes directly before keyword?

The pattern (?:\"(.*)\" )(?=keyword) matches the first " and then matches the last occurrence where a double quote followed by a space is followed by keyword because the dot also matches a double quote.
Note that in the pattern the non capturing group (?: can be omitted and the " does not have to be escaped.
You could use a negated character class instead to match any character except a "
The value is in the first capturing group.
"([^"]+)"(?= keyword)
Explanation
" Match literally
( Capturing group
[^"]+ Match 1+ times any char except "
) Close group
"(?= keyword) Match " and assert what is directly to the right is a space and keyword
Regex demo
An example using Javascript
const regex = /"([^"]+)"(?= keyword)/g;
const str = `"ignore this" blah "ignore this" blah "capture this" keyword "ignore this" blah`;
while ((m = regex.exec(str)) !== null) {
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
console.log(m[1]);
}

Try using lookaround assertions
var input = `"ignore this" blah "ignore this" blah "capture this" keyword "ignore this" blah`;
var result = /(?<=\")[A-Za-z0-9\ ]*(?=\" keyword)/i.exec(input)
console.log(result);
Here (?<=\") looks for content that follows " and (?=\" keyword) looks for content that is followed by " keyword.
More about Lookahead and Lookbehind Zero-Length Assertions here:
https://www.regular-expressions.info/lookaround.html

Your string to be captured or returned as a result is in between double quotes followed by a specific keyword. simply find that pattern that matches " followed by anything that is not " then followed by " keyword.
var input = `"ignore this" blah "ignore this" blah "capture this" keyword "ignore this" blah`;
var result = /(?=\")?[^"]+(?=\"\s*keyword)/i.exec(input)
console.log(result);

Related

What is the RegExp that replace all the "." occurrences inside "<code>" tag

I would like to replace all the dots inside the HTML tag <code> with the word " dot ".
If I do like this it will change only the first occurrence.
I would like to change them all.
const str = 'Some text <code class="foo">this.is.a.class</code> and <code>this.another.thing</code>';
const res = str.replaceAll(/(<code[^>]*>.*?)(\.)(.*?<\/code>)/g, "$1 dot $3");
console.log(res);
// Some text <code class="foo">this dot is.a.class</code> and <code>this dot another.thing</code>
Why it is changing only the first?
You can use a String#replace method with a callback function:
const str = 'Some text <code class="foo">this.is.a.class</code> and <code>this.another.thing</code>';
const res = str.replaceAll(/(<code[^>]*>)([\s\S]*?<\/code>)/g, (_,x,y) => `${x}${y.replaceAll(".", " dot ")}`);
console.log(res);
The (<code[^>]*>)([\s\S]*?<\/code>) regex matches and captures the open code tag into Group 1 (x variable), and then captures any zero or more chars as few as possible and then the close code tag into Group 2 (y variable). When replacing, all dots in y (Group 2 captured value) are replaced with DOT inside the arrow function.
Note that [\s\S] matches any chars including line break chars, . does not match line break chars (at least by default).

Character not at begining of line; not followed or preceded by character

I'm trying to isolate a " character when (simultaneously):
it's not in the beginning of the line
it's not followed by the character ";"
it's not preceded by the character ";"
E.g.:
Line: "Best Before - NO MATCH
Line: Best Before"; - NO MATCH
Line: ;"Best "Before - NO MATCH
Line: Best "Before - MATCH
My best solution is (?<![;])([^^])(")(?![;]) but it's not working correctly.
I also tried (?<![;])(")(?![;]), but it's only partial (missing the "not at the beginning" part)
I don't understand why I'm spelling the "AND not at the beginning" wrong.
Where am I missing it?
If you want to allow partial matches, you can extend the lookbehind with an alternation not asserting the start of the string to the left.
The semi colon [;] does not have to be between square brackets.
(?<!;|^)"(?!;)
Regex demo
if you want to match the " when there is no occurrence of '" to the left and right, and a infinite quantifier in a lookbehind assertion is allowed:
(?<!^.*;(?=").*|^)"(?!;|.*;")
Regex demo
In notepad++ you can use
^.*(?:;"|";).*$(*SKIP)(*F)|(?<!^)"
Regex demo
You can use the fact that not preceded by ; means that it's also not the first character on the line to simplify things
[^;]"(?:[^;]|$)
This gives you
Match a character that's not a ; (so there must be a character and thus the next character can't be the start of the line)
Match a "
Match a character that's not a ; or the end of the line
I know you are asking for a regex solution, but, almost always, strings can also be filtered using string methods in whatever language you are working in.
For the sake of completeness, to show that regex is not your only available tool here, here is a short javascript using the string methods:
myString.charAt()
myString.includes()
Working Example:
const checkLine = (line) => {
switch (true) {
// DOUBLE QUOTES AT THE BEGINNING
case(line.charAt(0) === '"') :
return console.log(line, '// NO MATCH');
// DOUBLE QUOTES IMMEDIATELY FOLLOWED BY SEMI-COLON
case(line.includes('";')) :
return console.log(line, '// NO MATCH');
// DOUBLE QUOTES IMMEDIATELY PRECEDED BY SEMI-COLON
case(line.includes(';"')) :
return console.log(line, '// NO MATCH');
default:
return console.log(line, '// MATCH');
}
}
checkLine('"Best Before');
checkLine('Best Before";');
checkLine(';"Best "Before');
checkLine('Best "Before');
Further Reading:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/includes

regex - how to specify the expressions to exclude

I need to replace two characters {, } with {\n, \n}.
But they must be not surrounded in '' or "".
I tried this code to achieve that
text = 'hello(){imagine{myString("HELLO, {WORLD}!")}}'
replaced = re.sub(r'{', "{\n", text)
Ellipsis...
Naturally, This code replaces curly brackets that are surrounded in quote marks.
What are the negative statements like ! or not that can be used in regular expressions?
And the following is what I wanted.
hello(){
imagine{
puts("{HELLO}")
}
}
In a nutshell - what I want to do is
Search { and }.
If that is not enclosed in '' or ""
replace { or } to {\n or \n}
In the opposite case, I can solve it with (?P<a>\".*){(?P<b>.*?\").
But I have no clue how I can solve it in my case.
First replace all { characters with {\n. You will also be replacing {" with {\n". Now, you can replace back all {\n" characters with {".
text = 'hello(){imagine{puts("{HELLO}")}}'
replaced = text.replace('{', '{\n').replace('{\n"','{"')
You may match single and double quoted (C-style) string literals (those that support escape entities with backslashes) and then match { and } in any other context that you may replace with your desired values.
See Python demo:
import re
text = 'hello(){imagine{puts("{HELLO}")}}'
dblq = r'(?<!\\)(?:\\{2})*"[^"\\]*(?:\\.[^"\\]*)*"'
snlq = r"(?<!\\)(?:\\{2})*'[^'\\]*(?:\\.[^'\\]*)*'"
rx = re.compile(r'({}|{})|[{{}}]'.format(dblq, snlq))
print(rx.pattern)
def repl(m):
if m.group(1):
return m.group(1)
elif m.group() == '{':
return '{\n'
else:
return '\n}'
# Examples
print(rx.sub(repl, text))
print(rx.sub(repl, r'hello(){imagine{puts("Nice, Mr. \"Know-all\"")}}'))
print(rx.sub(repl, "hello(){imagine{puts('MORE {HELLO} HERE ')}}"))
The pattern that is generated in the code above is
((?<!\\)(?:\\{2})*"[^"\\]*(?:\\.[^"\\]*)*"|(?<!\\)(?:\\{2})*'[^'\\]*(?:\\.[^'\\]*)*')|[{}]
It can actually be reduced to
(?<!\\)((?:\\{2})*(?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*'))|[{}]
See the regex demo.
Details:
The pattern matches 2 main alternatives. The first one matches single- and double-quoted string literals.
(?<!\\) - no \ immediately to the left is allowed
((?:\\{2})*(?:"[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')) - Group 1:
(?:\\{2})* - 0+ repetitions of two consecutive backslashes
(?: - a non-capturing group:
"[^"\\]*(?:\\.[^"\\]*)*" - a double quoted string literal
| - or
'[^'\\]*(?:\\.[^'\\]*)*' - a single quoted string literal
) - end of the non-capturing group
| - or
[{}] - a { or }.
In the repl method, Group 1 is checked for a match. If it matched, the single- or double-quoted string literal is matched, it must be put back where it was. Else, if the match value is {, it is replaced with {\n, else, with \n}.
Replace { with {\n:
text.replace('{', '{\n')
Replace } with \n}:
text.replace('}', '\n}')
Now to fix the braces that were quoted:
text.replace('"{\n','"{')
and
text.replace('\n}"', '}"')
Combined together:
replaced = text.replace('{', '{\n').replace('}', '\n}').replace('"{\n','"{').replace('\n}"', '}"')
Output
hello(){
imagine{
puts("{HELLO}")
}
}
You can check the similarities with the input and try to match them.
text = 'hello(){imagine{puts("{HELLO}")}}'
replaced = text.replace('){', '){\n').replace('{puts', '{\nputs').replace('}}', '\n}\n}')
print(replaced)
output:
hello(){
imagine{
puts("{HELLO}")
}
}
UPDATE
try this: https://regex101.com/r/DBgkrb/1

Regular expression for match string with new line char

How use regular expression to match in text passphrase between Passphrase= string and \n char (Select: testpasssword)? The password can contain any characters.
My partial solution: Passphrase.*(?=\\nName) => Passphrase=testpasssword
[wifi_d0b5c2bc1d37_7078706c617967726f756e64_managed_psk]\nPassphrase=testpasssword\nName=pxplayground\nSSID=9079706c697967726f759e69\nFrequency=2462\nFavorite=true\nAutoConnect=true\nModified=2018-06-18T09:06:26.425176Z\nIPv4.method=dhcp\nIPv4.DHCP.LastAddress=0.0.0.0\nIPv6.method=auto\nIPv6.privacy=disabled\n
With QRegularExpression that supports PCRE regex syntax, you may use
QString str = "your_string";
QRegularExpression rx(R"(Passphrase=\K.+?(?=\\n))");
qDebug() << rx.match(str).captured(0);
See the regex demo
The R"(Passphrase=\K.+?(?=\\n))" is a raw string literal defining a Passphrase=\K.+?(?=\\n) regex pattern. It matches Passphrase= and then drops the matched text with the match reset operator \K and then matches 1 or more chars, as few as possible, up to the first \ char followed with n letter.
You may use a capturing group approach that looks simpler though:
QRegularExpression rx(R"(Passphrase=(.+?)\\n)");
qDebug() << rx.match(str).captured(1); // Here, grab Group 1 value!
See this regex demo.
The only thing you were missing is the the lazy quantifier telling your regex to only match as much as necessary and a positive lookbehind. The first one being a simple question mark after the plus, the second one just prefacing the phrase you want to match but not include by inputting ?<=. Check the code example to see it in action.
(?<=Passphrase=).+?(?=\\n)
const regex = /(?<=Passphrase=).+?(?=\\n)/gm;
const str = `[wifi_d0b5c2bc1d37_7078706c617967726f756e64_managed_psk]\\nPassphrase=testpasssword\\nName=pxplayground\\nSSID=9079706c697967726f759e69\\nFrequency=2462\\nFavorite=true\\nAutoConnect=true\\nModified=2018-06-18T09:06:26.425176Z\\nIPv4.method=dhcp\\nIPv4.DHCP.LastAddress=0.0.0.0\\nIPv6.method=auto\\nIPv6.privacy=disabled\\n
`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}

Why won't Groovy honor "OR" instances in my regex?

It is well established that "|" in a regex is the "OR" operator. So when I run this:
static void main(String[] args) {
String permission = "[fizz]:[index]"
if((permission =~ /\[fizz|buzz]:\[.*]/).matches()) {
println "We match!"
} else {
println "We don't match!"
}
}
...then why does it print "We don't match!"???
The regex \[fizz|buzz]:\[.*] matches:
\[fizz - literal [ followed by fizz
| - OR operator....
buzz]:\[ - matches literal buzz]:[
.* - any character but a newline, as many times as possible, greedy
] - a literal ].
I think you need to re-group the alternatives:
if((permission =~ /\[(?:fizz|buzz)]:\[[^\]]*]/).matches()) {
Here, \[(?:fizz|buzz)]:\[[^\]]*] will match a [, then either fizz or buzz without capturing the words, then ]:[, [^\]]* will match 0 or more any characters but a ] and then ].
Check the regex101 demo. Also checked at OCP Regex Visualizer: