Get all text between two parentheses but not a specific word - regex

Use Regex to highlight all text between two parentheses but not a specific word
ex:
"This is a << long text used just as an example to test >> how I can use Regex"
There are words between << >> I need to have two different styles, one for any text between <<>> and another one for only the bold text: example
"long text.......to test" will be in RED
ONLY "simple" will be in GREEN
any other words which outside the parentheses will be in Black
I'm using flutter_parsed_text package
ParsedText(
text: widget.text, \\The main text which to be parsed with regex
style: TextStyle(color: Colors.black), \\defult style if none is matching
parse: [
//1st pattern match anything between <<>>
MatchText(
pattern: "\<<(.*?)\>>", // " << everything but not $searchKeyword >> "
style: TextStyle(color: ColorsUtils.red),
),
//2nd pattern match the searchKeyword which in not selected in any pattern
MatchText(
pattern: '$searchKeyword',
style: TextStyle(color: Colors.green),
)
],
);

Related

Capture text in quotes immediately before keyword

I have an input stream that looks like this:
"ignore this" blah "ignore this" blah "capture this" keyword "ignore this" blah
I want to capture capture this, i.e. the text in quotes before keyword.
I tried the regex (?:\"(.*)\" )(?=keyword), but this captures everything up to the quotation mark before keyword. How would I capture the text in quotes directly before keyword?
The pattern (?:\"(.*)\" )(?=keyword) matches the first " and then matches the last occurrence where a double quote followed by a space is followed by keyword because the dot also matches a double quote.
Note that in the pattern the non capturing group (?: can be omitted and the " does not have to be escaped.
You could use a negated character class instead to match any character except a "
The value is in the first capturing group.
"([^"]+)"(?= keyword)
Explanation
" Match literally
( Capturing group
[^"]+ Match 1+ times any char except "
) Close group
"(?= keyword) Match " and assert what is directly to the right is a space and keyword
Regex demo
An example using Javascript
const regex = /"([^"]+)"(?= keyword)/g;
const str = `"ignore this" blah "ignore this" blah "capture this" keyword "ignore this" blah`;
while ((m = regex.exec(str)) !== null) {
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
console.log(m[1]);
}
Try using lookaround assertions
var input = `"ignore this" blah "ignore this" blah "capture this" keyword "ignore this" blah`;
var result = /(?<=\")[A-Za-z0-9\ ]*(?=\" keyword)/i.exec(input)
console.log(result);
Here (?<=\") looks for content that follows " and (?=\" keyword) looks for content that is followed by " keyword.
More about Lookahead and Lookbehind Zero-Length Assertions here:
https://www.regular-expressions.info/lookaround.html
Your string to be captured or returned as a result is in between double quotes followed by a specific keyword. simply find that pattern that matches " followed by anything that is not " then followed by " keyword.
var input = `"ignore this" blah "ignore this" blah "capture this" keyword "ignore this" blah`;
var result = /(?=\")?[^"]+(?=\"\s*keyword)/i.exec(input)
console.log(result);

QRegExp to extract array name and index

I am parsing some strings. If I encounter something like "Foo(bar)", I want to extract "Foo" and "bar"
How do I do it using QRegExp?
First thing, if you are using Qt 5 then rather use QRegularExpression class
The QRegularExpression class introduced in Qt 5 is a big improvement upon QRegExp, in terms of APIs offered, supported pattern syntax and speed of execution.
Secondly, get a visual tool that helps when testing/defining regular expressions, I use an online website.
To get the "Foo" and "Bar" from your example, I can suggest the following pattern:
(\w+)\((\w+)\)
--------------
The above means:
(\w+) - Capture one or more word characters (capture group 1)
\( - followed by a opening brace
(\w+) - then capture one or more word characters (capture group 2)
\) - followed by a closing brace
This pattern must be escaped for direct usage in the Qt regular expression:
const QRegularExpression expression( "(\\w+)\\((\\w+)\\)" );
QRegularExpressionMatch match = expression.match( "Foo(bar)" );
if( match.hasMatch() ) {
qDebug() << "0: " << match.captured( 0 ); // 0 is the complete match
qDebug() << "1: " << match.captured( 1 ); // First capture group
qDebug() << "2: " << match.captured( 2 ); // Second capture group
}
Output is:
0: "Foo(bar)"
1: "Foo"
2: "bar"
See the pattern in action online here. Hover the mouse over the parts in the "Expression" box to see the explanations or over the "Text" part to see the result.

PhpStorm search and replace multiple times between two strings

In PhpStorm IDE, using the search and replace feature, I'm trying to add .jpg to all strings between quotes that come after $colorsfiles = [ and before the closing ].
$colorsfiles = ["Blue", "Red", "Orange", "Black", "White", "Golden", "Green", "Purple", "Yellow", "cyan", "Gray", "Pink", "Brown", "Sky Blue", "Silver"];
If the "abc" is not in between $colorsfiles = [ and ], there should be no replacement.
The regex that I'm using is
$colorsfiles = \[("(\w*?)", )*
and replace string is
$colorsfiles = ["$2.jpg"]
The current result is
$colorsfiles = ["Brown.jpg"]"Sky Blue", "Silver"];
While the expected output is
$colorsfiles = ["Blue.jpg", "Red.jpg", "Orange.jpg", "Black.jpg", "White.jpg", "Golden.jpg", "Green.jpg", "Purple.jpg", "Yellow.jpg", "cyan.jpg", "Gray.jpg", "Pink.jpg", "Brown.jpg", "Sky Blue.jpg", "Silver.jpg"];
You should have said that you're trying it on IDE
Even though I don't use PHPStorm, I'm posting solution tested on my NetBeans.
Find : "([\w ]+)"([\,\]]{1})
Replace : "$1\.jpg"$2
why you need regex for this? a simple array_map() will do the trick for you.
<?php
function addExtension($color)
{
return $color.".jpg";
}
$colorsfiles = ["Blue", "Red", "Orange", "Black", "White", "Golden", "Green", "Purple", "Yellow", "cyan", "Gray", "Pink", "Brown", "Sky Blue", "Silver"];
$colorsfiles_with_extension = array_map("addExtension", $colorsfiles);
print_r($colorsfiles_with_extension);
?>
Edit: I've tested it on my PhpStorm, let's do it like
search:
"([a-zA-Z\s]+)"
replace_all:
"$1.jpg"
You may use
(\G(?!^)",\s*"|\$colorsfiles\s*=\s*\[")([^"]+)
and replace with $1$2.jpg. See this regex demo.
The regex matches $colorsfiles = [" or the end of the previous match followed with "," while capturing these texts into Group 1 (later referred to with $1 placeholder) and then captures into Group 2 (later referred to with $2) one or more chars other than a double quotation mark.
Details
(\G(?!^)",\s*"|\$colorsfiles\s*=\s*\[") -
\G(?!^)",\s*" - the end of the previous match (\G(?!^)), ", substring, 0+ whitespaces (\s*) and a " char
| - or
\$colorsfiles\s*=\s*\[" - $colorsfiles, 0+ whitespaces (\s*), =, 0+ whitespaces, [" (note that $ and [ must be escaped to match literal chars)
([^"]+) - Capturing group 2: one or more (+) chars other than " (the negated character class, [^"])

validator.addMethod for checking before and end whitespaces

I want to validate a field with white spaces either before a text string or after. It is allowed to have space in the middle string.
Here is my code
$.validator.addMethod("trimLookup", function(value, element) {
regex = "^[^\s]+(\s+[^\s]+)*$";
regex = new RegExp( regex );
return this.optional( element ) || regex.test( value );
}, $.validator.format("Cannot contains any spaces at beginning or end"));
I test the regex in https://regex101.com/ it works fine. I also test this code with other regex it works. But if enter " " or " abc " it doesn't work.
Any Suggestion?
Thank you for your time!

Problems with an ambiguous grammar and PEG.js (no examples found)

I want to parse a file with lines of the following content:
simple word abbr -8. (012) word, simple phrase, one another phrase - (simply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!; "It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose") This is the end of the line
so now explaining the parts (not all spaces are described, because of the markup here):
simple word is one or several words (phrase) separated by the whitespace
abbr - is a fixed part of the string (never changes)
8 - optional number
. - always included
word, simple phrase, one another phrase - one or several words or phrases separated by comma
- ( - fixed part, always included
simply dummy text of the printing; Lorem Ipsum : "Lorem" - has been the industry's standard dummy text, ever since the 1500s!; - (optional) one or several phrases separated by ;
"It is a long established!"; "Sometimes by accident, sometimes on purpose (injected humour and the like)"; "sometimes on purpose" - (optional) one or several phrases with quotation marks "separated by ;
) This is the end of the line - always included
In the worst case there are no phrases in clause, but this is uncommon: there should be a phrase without augmenting quotation marks (phrase1 type) or with them (phrase2 type).
So the phrases are Natural Language sentences (with all the punctuation possible)...
BUT:
the internal content is irrelevant (i.e. I do not need to parse the Natural Language itself in the NLP meaning)
it is just required to mark it as a phrase1 or phrase2 types:
those without and with quotation marks, i.e., if the phrase, which is placed between ( and ; or ; and ; or ; and ) or even between ( and ) is augmented with quotation marks, then it is phrase2 type
otherwise, if the phrase either begins or ends without quotation marks, though it could have all the marks within the phrase, it is the phrase1 type
Since to write a Regex (PCRE) for such an input is an overkill, so I looked to the parsing approach (EBNF or similar). I ended up with a PEG.js parser generator. I created a basic grammar variants (even not handling the part with the different phrases in the clause):
start = term _ "abbr" _ "-" .+
term = word (_? word !(_ "abbr" _ "-"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
= [ \t\n\r]*
or (the difference is only in " abbr -" and "_ "abbr" _ "-"" ):
start = term " abbr -" .+
term = word (_? word !(" abbr -"))+
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
_ "whitespace"
= [ \t\n\r]*
But even this simple grammar cannot parse the beginning of the string. The errors are:
Parse Error Expected [A-Za-z] but " " found.
Parse Error Expected "abbr" but "-" found.
etc.
So it looks the problem is in the ambiguity: "abbr" is consumed withing the term as a word token. Although I defined the rule !(" abbr -"), which I thought has a meaning, that the next word token will be only consumed, if the next substring is not of " abbr -" kind.
I didn't found any good examples explaining the following expressions of PEG.js, which seems to me as a possible solution of the aforementioned problem [from: http://pegjs.majda.cz/documentation]:
& expression
! expression
$ expression
& { predicate }
! { predicate }
TL;DR:
related to PEG.js:
are there any examples of applying the rules:
& expression
! expression
$ expression
& { predicate }
! { predicate }
general question:
what is the possible approach to handle such complex strings with intuitive ambiguous grammars? This is still not a Natural Language and it looks like it has some formal structure, just with several optional parts. One of the ideas is to split the strings by preprocessing (with the help of Regular Expressions, in the places of fixed elements, i.e. "abbr -" ") This is the end of the line") and then create for each splited part a separate grammar. But it seems to have have performance issues and scalability problems (i.e. - what if the fixed elements will change a bit - e.g. there will be no - char anymore.)
Update1:
I found the rule which solves the problem with matching the "abbr -" ambiguity:
term = term:(word (!" abbr -" _? word))+ {return term.join("")}
but the result looks strange:
[
"simple, ,word",
" abbr -",
[
"8",
...
],
...
]
if removing the predicate: term = term:(word (!" abbr -" _? word))+:
[
[
"simple",
[
[
undefined,
[
" "
],
"word"
]
]
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
I expected something like:
[
[
"simple word"
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
or at least:
[
[
"simple",
[
" ",
"word"
]
],
" abbr -",
[
"8",
".",
" ",
"(",
...
],
...
]
The expression is grouped, so why is it separated in so many nesting levels and even undefined is included in the output? Are there any general rules to fold the result based on the expression in the rule?
Update2:
I created the grammar so that it parses as desired, though I didn't yet identified the clear process of such a grammar creation:
start
= (term:term1 (" abbr -" number "." _ "("number:number") "{return number}) terms:terms2 ((" - (" phrases:phrases ")" .+){return phrases}))
//start //alternative way = looks better
// = (term:term1 " abbr -" number "." _ "("number:number") " terms:terms2 " - (" phrases:phrases ")" .+){return {term: term, number: number, phrases:phrases}}
term1
= term1:(
start_word:word
(rest_words:(
rest_word:(
(non_abbr:!" abbr -"{return non_abbr;})
(space:_?{return space[0];}) word){return rest_word.join("");})+{return rest_words.join("")}
)) {return term1.join("");}
terms2
= terms2:(start_word:word (rest_words:(!" - (" ","?" "? word)+){rest_words = rest_words.map(function(array) {
return array.filter(function(n){return n != null;}).join("");
}); return start_word + rest_words.join("")})
phrases
// = ((phrase_t:(phrase / '"' phrase '"') ";"?" "?){return phrase_t})+
= (( (phrase:(phrase2 / phrase1) ";"?" "?) {return phrase;})+)
phrase2
= (('"'pharse2:(phrase)'"'){return {phrase2: pharse2}})
phrase1
= ((pharse1:phrase){return {phrase1: pharse1}})
phrase
= (general_phrase:(!(';' / ')' / '";' / '")') .)+ ){return general_phrase.map(function(array){return array[1]}).join("")}
word = letters:letter+ {return letters.join("")}
letter = [A-Za-z]
number = digits:digit+{return digits.join("")}
digit = [0-9]
_ "whitespace"
= [ \t\n\r]*
It could be tested either on the PEG.js author's site: [http://pegjs.majda.cz/online] or on the PEG.js Web-IDE: [http://peg.arcanis.fr/]
If somebody has answers for the previous questions (i.e. general approach for disambiguation of the grammar, examples to the expressions available in PEG.js) as well as improvement advices to the grammar itself (this is I think far away from an ideal grammar now), I would very appreciate!
so why is it separated in so many nesting levels and even undefined is included in the output?
If you look at the documentation for PEG.js, you'll see almost every operator collects the results of its operands into an array. undefined is returned by the ! operator.
The $ operator bypasses all this nesting and just gives you the actual string that matches, eg: [a-z]+ will give an array of letters, but $[a-z]+ will give a string of letters.
I think most of the parsing here follows the pattern: "give me everything until I see this string". You should express this in PEG by first using ! to make sure you haven't hit the terminating string, and then just taking the next character. For example, to get everything up to " abbr -":
(!" abbr -" .)+
If the terminating string is a single character, you can use [^] as a short form of this, eg: [^x]+ is a shorter way of saying (!"x" .)+.
Parsing comma/semicolon separated phrases rather than comma/semicolon terminated phrases is slightly annoying, but treating them as optional terminators seems to work (with some triming).
start = $(!" abbr -" .)+ " abbr -" $num "." [ ]? "(012)"
phrase_comma+ "- (" noq_phrase_semi+ q_phrase_semi+ ")"
$.*
phrase_comma = p:$[^-,]+ [, ]* { return p.trim() }
noq_phrase_semi = !'"' p:$[^;]+ [; ]* { return p.trim() }
q_phrase_semi = '"' p:$[^"]+ '"' [; ]* { return p }
num = [0-9]+
gives
[
"simple word",
" abbr -",
"8",
".",
" ",
"(012)",
[
"word",
"simple phrase",
"one another phrase"
],
"- (",
[
"simply dummy text of the printing",
"Lorem Ipsum : \"Lorem\" - has been the industry's standard dummy text, ever since the 1500s!"
],
[
"It is a long established!",
"Sometimes by accident, sometimes on purpose (injected humour and the like)",
"sometimes on purpose"
],
")",
" This is the end of the line"
]