regex to replace specific characters while capturing the rest of the line - regex

Using notepad++
I need to replace ", " with , on a line beginning exclusively with genre: and no where else in the document, while maintaining all of the other content in the line. I will be applying the search/replace to an entire folder, so I need to be as precise as I can.
Examples
genre: "drama", "thriller", "mystery", "espionage"
genre: "drama", "sci-fi"
should look like this:
genre: "drama, thriller, mystery, espionage"
genre: "drama, sci-fi"
I'm having a hell of a time figuring out how to do that without capturing an unlimited and unknown number of groups before and after each instance of ", ", while also keeping the first word and colon: genre: . I'm pretty sure I have to capture the entire group between the first and last ", and then replace ", " with just , within that group, but I can't figure out how to do that.
Obviously what I have here isn't going to do the trick.
find what: ^genre: "(.*)", "(.*)", "(.*)", "(.*)"
replace with: genre: "$1, $2, $3, $4"

You can use
Find: (?:\G(?!^)|^genre:\h*").*?\K",\h*"
Replace: ,<SPACE>
Details:
(?:\G(?!^)|^genre:\h*") - end of the previous match position or genre:, zero or more horizontal whitespaces and " at the start of string (here, line)
.*? - any zero or more chars other than line break chars, as few as possible
\K - omit the matched text
",\h*" - consume ",, then zero or more horizontal whitespaces, and then a " (this will be replaced with , + space)
See the regex demo:

Try this code then, Updated Answer
Find: (?:^genre|\G)(?!^).*?\K", "
Replace All: , there is a space after ","

Related

Regexp to cut all text inside the external quotation signs

Please help me with adjusting regexp. I need to cut all text inside the external quotation signs.
I have text:
some text "have "some text" here "that should" be cut"
My regexp:
some text "(?<name>[^"]*)"
Need to get
have "some text" here "that should" be cut
But I've got
have
If you want to supported the first level of nested double quotes you can use
some text "(?<name>[^"]*(?:"[^"]*"[^"]*)*)"
See the regex demo.
Details:
[^"]* - zero or more chars other than double quotes
(?:"[^"]*"[^"]*)* - zero or more repetitions of
"[^"]*" - a substring between double quotes that contains no other double quotes
[^"]* - zero or more chars other than double quotes.
If your regex flavor supports recursion:
some text ("(?<name>(?:[^"]++|\g<1>)*)")
See this regex demo. Here, ("(?<name>(?:[^"]++|\g<1>)*)") is a capturing group #1 that matches
" - a " char
(?<name>(?:[^"]++|\g<1>)*) - Group "name": zero or more sequences of
[^"]++ - one or more chars other than "
| - or
\g<1> - Group 1 pattern recursed
" - a " char
Assuming you want to remove all text up to the first quotes then retain everything till the last quote, you can try this.
Demo
[[:alpha:]][^"]*\"(?<name>.*)"
You can solve this problem with nested regexp operators:
SELECT regexp_replace(Regexp_substr(regexp_replace(word,'(^")|("$)'),'["].+'),'(^")') as Result
from(
SELECT '"some text "have "some text" here "that should" be cut"' as word from dual)

Notepad++: reemplace ocurrences of characters before other character

I have a file with text like this:
"Title" = "Body"
And I would like to remove both " before the =, to leave it like this:
Title = "Body"
So far I managed to select the first block of text with:
.+(=)
That selects everything up to the =, but I can't find how to reemplace (or delete) both " .
Any suggestions?
You could use a capture group in the replacement, and match the double quotes to be removed while asserting an equals sign at the right.
Find what:
"([^"]+)"(?=\h*=)
" Match literally
([^"]+) Capture group 1, match 1+ times any char other than "
" Match literally
(?=\h*=) Positive lookahead, assert an = sigh at the right
Regex demo
Replace with:
$1
To match the whole pattern from the start till end end of the string, you might also use 2 capture groups and use those in the replacement.
^"([^"]+)"(\h*=\h*"[^"]+")$
Regex demo
In the replacement use $1$2
You can use
(?:\G(?!^)|^(?=.*=))[^"=\v]*\K"
Replace with an empty string.
Details:
(?:\G(?!^)|^(?=.*=)) - end of the previous successful match (\G(?!^)) or (|) start of a line that contains = somewhere on it (^(?=.*=))
[^"=\v]* - any zero or more chars other than ", = and vertical whitespace
\K - omit the text matched
" - a " char (matched, consumed and removed)
See the screenshot with settings and a demo:

How to add "backspace character" to regex output change in vscode?

I have file containing around ~1400 lines. In each line there are infomation + in next line is next information which I want move "to previous" line (where is text)
I tried " for" changing into "\r |" - only that was coming to my head in that time.
For example here it's "structure" of my file:
T="topic 1"
for xxx#xxx.com
T="topic 2"
for yyy#yyy.com
I wanted move that to clear into that
T="topic 1" | for xxx#xxx.com
T="topic 2" | for yyy#yyy.com
You may use
Find what: \n( for)\b
Replace with: |$1
Details
\n - a line break
( for) - Capturing group 1 ($1): a space and for
\b - word boundary.
Test result:
Another option if you don't want keep for could be to match:
\n[ \t]+for[ ]
That will match:
\n Match a line break
[ \t]+ Match 1+ times a space or char (Or just a single space if that is the case)
for[ ] Match for followed by a space (the square brackets are for clarity only
And replace with a space, a pipe followed by a space
|
Regex demo

Extract only ALLCAPS words with regex

Looking for a way to extract only words that are in ALL CAPS from a text string. The catch is that it shouldn't extract other words in the text string that are mixed case.
For example, how do I use regex to extract KENTUCKY from the following sentence:
There Are Many Options in KENTUCKY
I'm trying to do this using regexextract() in Google Sheets, which uses RE2.
Looking forward to hearing your thoughts.
Pretending that your text is in cell A2:
If there is only one instance in each text segment this will work:
=REGEXEXTRACT(A2,"([A-Z]{2,})")
If there are multiple instances in a single text segment then use this, it will dynamically adjust the regex to extract every occurrance for you:
=REGEXEXTRACT(A2, REPT(".* ([A-Z]{2,})", COUNTA(SPLIT(REGEXREPLACE(A2,"([A-Z]{2,})","$"),"$"))-1))
If you need to extract whole chunks of words in ALLCAPS, use
=REGEXEXTRACT(A2,"\b[A-Z]+(?:\s+[A-Z]+)*\b")
=REGEXEXTRACT(A2,"\b\p{Lu}+(?:\s+\p{Lu}+)*\b")
See this regex demo.
Details
\b - word boundary
[A-Z]+ - 1+ ASCII letters (\p{Lu} matches any Unicode letters inlcuding Arabic, etc.)
(?:\s+[A-Z]+)* - zero or more repetitions of
\s+ - 1+ whitespaces
[A-Z]+ - 1+ ASCII letters (\p{Lu} matches any Unicode letters inlcuding Arabic, etc.)
\b - word boundary.
Or, if you allow any punctuations or symbols between uppercase letters you may use
=REGEXEXTRACT(A2,"\b[A-Z]+(?:[^a-zA-Z0-9]+[A-Z]+)*\b")
=REGEXEXTRACT(A2,"\b\p{Lu}+(?:[^\p{L}\p{N}]+\p{Lu}+)*\b")
See the regex demo.
Here, [^a-zA-Z0-9]+ matches one or more chars other than ASCII letters and digits, and [^\p{L}\p{N}]+ matches any one or more chars other than any Unicode letters and digits.
This should work:
\b[A-Z]+\b
See demo
2nd EDIT ALL CAPS / UPPERCASE solution:
Finally got this simpler way from great other helping solutions here and here:
=trim(regexreplace(regexreplace(C15,"(?:([A-Z]{2,}))|.", " $1"), "(\s)([A-Z])","$1 $2"))
From this input:
isn'ter JOHN isn'tar DOE isn'ta or JANE
It returns this output:
JOHN DOE JANE
The Same For Title Case (Extracting All Capitalized / With 1st Letter As Uppercase Words :
Formula:
=trim(regexreplace(regexreplace(C1,"(?:([A-Z]([a-z]){1,}))|.", " $1"), "(\s)([A-Z])","$1 $2"))
Input in C1:
The friendly Quick Brown Fox from the woods Jumps Over the Lazy Dog from the farm.
Output in A1:
The Quick Brown Fox Jumps Over Lazy Dog
Previous less efficient trials :
I had to custom tailor it that way for my use case:
= ArrayFormula(IF(REGEXMATCH(REGEXREPLACE(N3: N,
"(^[A-Z]).+(,).+(\s[a-z]\s)|(^[A-Z][a-z]).+(\s[a-z][a-z]\s)|(^[A-Z]\s).+(\.\s[A-Z][a-z][a-z]\s)|[A-Z][a-z].+[0-9]|[A-Z][a-z].+[0-9]+|(^[A-Z]).+(\s[A-Z]$)|(^[A-Z]).+(\s[A-Z][a-z]).+(\s[A-Z])|(\s[A-Z][a-z]).+(\s[A-Z]\s).+(\s[A-Z])|(^[A-Z][a-z]).+(\s[A-Z]$)|(\s[A-Z]\s).+(\s[A-Z]\s)|(\s[A-Z]\s)|^[A-Z].+\s[A-Z]((\?)|(\!)|(\.)|(\.\.\.))|^[A-Z]'|^[A-Z]\s|\s[A-Z]'|[A-Z][a-z]|[a-z]{1,}|(^.+\s[A-Z]$)|(\.)|(-)|(--)|(\?)|(\!)|(,)|(\.\.\.)|(\()|(\))|(\')|("
")|(“)|(”)|(«)|(»)|(‘)|(’)|(<)|(>)|(\{)|(\})|(\[)|(\])|(;)|(:)|(#)|(#)|(\*)|(¦)|(\+)|(%)|(¬)|(&)|(|)|(¢)|($)|(£)|(`)|(^)|(€)|[0-9]|[0-9]+",
""), "[A-Z]{2,}") = FALSE, "", REGEXREPLACE(N3: N,
"(^[A-Z]).+(,).+(\s[a-z]\s)|(^[A-Z][a-z]).+(\s[a-z][a-z]\s)|(^[A-Z]\s).+(\.\s[A-Z][a-z][a-z]\s)|[A-Z][a-z].+[0-9]|[A-Z][a-z].+[0-9]+|(^[A-Z]).+(\s[A-Z]$)|(^[A-Z]).+(\s[A-Z][a-z]).+(\s[A-Z])|(\s[A-Z][a-z]).+(\s[A-Z]\s).+(\s[A-Z])|(^[A-Z][a-z]).+(\s[A-Z]$)|(\s[A-Z]\s).+(\s[A-Z]\s)|(\s[A-Z]\s)|^[A-Z].+\s[A-Z]((\?)|(\!)|(\.)|(\.\.\.))|^[A-Z]'|^[A-Z]\s|\s[A-Z]'|[A-Z][a-z]|[a-z]{1,}|(^.+\s[A-Z]$)|(\.)|(-)|(--)|(\?)|(\!)|(,)|(\.\.\.)|(\()|(\))|(\')|("
")|(“)|(”)|(«)|(»)|(‘)|(’)|(<)|(>)|(\{)|(\})|(\[)|(\])|(;)|(:)|(#)|(#)|(\*)|(¦)|(\+)|(%)|(¬)|(&)|(|)|(¢)|($)|(£)|(`)|(^)|(€)|[0-9]|[0-9]+",
"")))
Going one by one over all exceptions and adding their respective regex formulations to the front of the multiple pipes separated regexes in the regexextract function.
#Wiktor Stribiżew any simplifying suggestions would be very welcome.
Found some missing and fixed them.
1st EDIT:
A simpler version though still quite lengthy:
= ArrayFormula(IF(REGEXMATCH(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
REGEXREPLACE(REGEXREPLACE(P3: P, "[a-z,]",
" "), "-|\.", " "), "(^[A-Z]\s)", " "
), "(\s[A-Z]\s)", " "),
"\sI'|\sI\s|^I'|^I\s|\sI(\.|\?|\!)|\sI$|\sA\s|^A\s|\.\.\.|\.|-|--|,|\?|\!|\.|\(|\)|'|"
"|:|;|\'|“|”|«|»|‘|’|<|>|\{|\}|\[|\]|#|#|\*|¦|\+|%|¬|&|\||¢|$|£|`|^|€|[0-9]|[0-9]+",
" "), "[A-Z]{2,}") = FALSE, " ", REGEXREPLACE(
REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
P3: P, "[a-z,]", " "), "-|\.", " "),
"(^[A-Z]\s)", " "), "(\s[A-Z]\s)", " "),
"\sI'|\sI\s|^I'|^I\s|\sI(\.|\?|\!)|\sI$|\sA\s|^A\s|\.\.\.|\.|-|--|,|\?|\!|\.|\(|\)|'|"
"|:|;|\'|“|”|«|»|‘|’|<|>|\{|\}|\[|\]|#|#|\*|¦|\+|%|¬|&|\||¢|$|£|`|^|€|[0-9]|[0-9]+",
" ")))
From this example:
Multiple regex matches in Google Sheets formula

regex - capture group

I trying to write a regex to match the following at the beginning of a new line
- a number followed by parantheses e.g. 2) or 8)
- a number followed by period e.g. 5
- the character '-'
- the character '*'
the following strings should match
"1. Sorting function. If you have a long checklist it's very difficult."
"5) This is another example"
"-this is yet another one"
"* last item in the list"
I have tried this but it doesn't quite get me what I am looking for.
re.findall(r'(?m)\s*^[-*(\d.)(\d\))]',item)
Try
re.findall(r'^\s*(\d+(\)|\.)|-|\*)', item, re.MULTILINE)
It will match all sequences of numbers followed by a closing parenthesis or period as well as dashes and stars at the beginning of the line.
Example: https://regex101.com/r/cR2lZ5/6
Assuming that your quote marks " are not included, and that each line is a separate string,
^\d\.|^\d\)|^\-|^\*
Would be the regular expression. | is OR, \d is a digit, and you escape the special characters ".", ")", "-", and "*" by putting a backslash in front of them.
You can test your regular expressions here. Good luck!