Please help me with adjusting regexp. I need to cut all text inside the external quotation signs.
I have text:
some text "have "some text" here "that should" be cut"
My regexp:
some text "(?<name>[^"]*)"
Need to get
have "some text" here "that should" be cut
But I've got
have
If you want to supported the first level of nested double quotes you can use
some text "(?<name>[^"]*(?:"[^"]*"[^"]*)*)"
See the regex demo.
Details:
[^"]* - zero or more chars other than double quotes
(?:"[^"]*"[^"]*)* - zero or more repetitions of
"[^"]*" - a substring between double quotes that contains no other double quotes
[^"]* - zero or more chars other than double quotes.
If your regex flavor supports recursion:
some text ("(?<name>(?:[^"]++|\g<1>)*)")
See this regex demo. Here, ("(?<name>(?:[^"]++|\g<1>)*)") is a capturing group #1 that matches
" - a " char
(?<name>(?:[^"]++|\g<1>)*) - Group "name": zero or more sequences of
[^"]++ - one or more chars other than "
| - or
\g<1> - Group 1 pattern recursed
" - a " char
Assuming you want to remove all text up to the first quotes then retain everything till the last quote, you can try this.
Demo
[[:alpha:]][^"]*\"(?<name>.*)"
You can solve this problem with nested regexp operators:
SELECT regexp_replace(Regexp_substr(regexp_replace(word,'(^")|("$)'),'["].+'),'(^")') as Result
from(
SELECT '"some text "have "some text" here "that should" be cut"' as word from dual)
I have a file with text like this:
"Title" = "Body"
And I would like to remove both " before the =, to leave it like this:
Title = "Body"
So far I managed to select the first block of text with:
.+(=)
That selects everything up to the =, but I can't find how to reemplace (or delete) both " .
Any suggestions?
You could use a capture group in the replacement, and match the double quotes to be removed while asserting an equals sign at the right.
Find what:
"([^"]+)"(?=\h*=)
" Match literally
([^"]+) Capture group 1, match 1+ times any char other than "
" Match literally
(?=\h*=) Positive lookahead, assert an = sigh at the right
Regex demo
Replace with:
$1
To match the whole pattern from the start till end end of the string, you might also use 2 capture groups and use those in the replacement.
^"([^"]+)"(\h*=\h*"[^"]+")$
Regex demo
In the replacement use $1$2
You can use
(?:\G(?!^)|^(?=.*=))[^"=\v]*\K"
Replace with an empty string.
Details:
(?:\G(?!^)|^(?=.*=)) - end of the previous successful match (\G(?!^)) or (|) start of a line that contains = somewhere on it (^(?=.*=))
[^"=\v]* - any zero or more chars other than ", = and vertical whitespace
\K - omit the text matched
" - a " char (matched, consumed and removed)
See the screenshot with settings and a demo:
I have file containing around ~1400 lines. In each line there are infomation + in next line is next information which I want move "to previous" line (where is text)
I tried " for" changing into "\r |" - only that was coming to my head in that time.
For example here it's "structure" of my file:
T="topic 1"
for xxx#xxx.com
T="topic 2"
for yyy#yyy.com
I wanted move that to clear into that
T="topic 1" | for xxx#xxx.com
T="topic 2" | for yyy#yyy.com
You may use
Find what: \n( for)\b
Replace with: |$1
Details
\n - a line break
( for) - Capturing group 1 ($1): a space and for
\b - word boundary.
Test result:
Another option if you don't want keep for could be to match:
\n[ \t]+for[ ]
That will match:
\n Match a line break
[ \t]+ Match 1+ times a space or char (Or just a single space if that is the case)
for[ ] Match for followed by a space (the square brackets are for clarity only
And replace with a space, a pipe followed by a space
|
Regex demo
Looking for a way to extract only words that are in ALL CAPS from a text string. The catch is that it shouldn't extract other words in the text string that are mixed case.
For example, how do I use regex to extract KENTUCKY from the following sentence:
There Are Many Options in KENTUCKY
I'm trying to do this using regexextract() in Google Sheets, which uses RE2.
Looking forward to hearing your thoughts.
Pretending that your text is in cell A2:
If there is only one instance in each text segment this will work:
=REGEXEXTRACT(A2,"([A-Z]{2,})")
If there are multiple instances in a single text segment then use this, it will dynamically adjust the regex to extract every occurrance for you:
=REGEXEXTRACT(A2, REPT(".* ([A-Z]{2,})", COUNTA(SPLIT(REGEXREPLACE(A2,"([A-Z]{2,})","$"),"$"))-1))
If you need to extract whole chunks of words in ALLCAPS, use
=REGEXEXTRACT(A2,"\b[A-Z]+(?:\s+[A-Z]+)*\b")
=REGEXEXTRACT(A2,"\b\p{Lu}+(?:\s+\p{Lu}+)*\b")
See this regex demo.
Details
\b - word boundary
[A-Z]+ - 1+ ASCII letters (\p{Lu} matches any Unicode letters inlcuding Arabic, etc.)
(?:\s+[A-Z]+)* - zero or more repetitions of
\s+ - 1+ whitespaces
[A-Z]+ - 1+ ASCII letters (\p{Lu} matches any Unicode letters inlcuding Arabic, etc.)
\b - word boundary.
Or, if you allow any punctuations or symbols between uppercase letters you may use
=REGEXEXTRACT(A2,"\b[A-Z]+(?:[^a-zA-Z0-9]+[A-Z]+)*\b")
=REGEXEXTRACT(A2,"\b\p{Lu}+(?:[^\p{L}\p{N}]+\p{Lu}+)*\b")
See the regex demo.
Here, [^a-zA-Z0-9]+ matches one or more chars other than ASCII letters and digits, and [^\p{L}\p{N}]+ matches any one or more chars other than any Unicode letters and digits.
This should work:
\b[A-Z]+\b
See demo
2nd EDIT ALL CAPS / UPPERCASE solution:
Finally got this simpler way from great other helping solutions here and here:
=trim(regexreplace(regexreplace(C15,"(?:([A-Z]{2,}))|.", " $1"), "(\s)([A-Z])","$1 $2"))
From this input:
isn'ter JOHN isn'tar DOE isn'ta or JANE
It returns this output:
JOHN DOE JANE
The Same For Title Case (Extracting All Capitalized / With 1st Letter As Uppercase Words :
Formula:
=trim(regexreplace(regexreplace(C1,"(?:([A-Z]([a-z]){1,}))|.", " $1"), "(\s)([A-Z])","$1 $2"))
Input in C1:
The friendly Quick Brown Fox from the woods Jumps Over the Lazy Dog from the farm.
Output in A1:
The Quick Brown Fox Jumps Over Lazy Dog
Previous less efficient trials :
I had to custom tailor it that way for my use case:
= ArrayFormula(IF(REGEXMATCH(REGEXREPLACE(N3: N,
"(^[A-Z]).+(,).+(\s[a-z]\s)|(^[A-Z][a-z]).+(\s[a-z][a-z]\s)|(^[A-Z]\s).+(\.\s[A-Z][a-z][a-z]\s)|[A-Z][a-z].+[0-9]|[A-Z][a-z].+[0-9]+|(^[A-Z]).+(\s[A-Z]$)|(^[A-Z]).+(\s[A-Z][a-z]).+(\s[A-Z])|(\s[A-Z][a-z]).+(\s[A-Z]\s).+(\s[A-Z])|(^[A-Z][a-z]).+(\s[A-Z]$)|(\s[A-Z]\s).+(\s[A-Z]\s)|(\s[A-Z]\s)|^[A-Z].+\s[A-Z]((\?)|(\!)|(\.)|(\.\.\.))|^[A-Z]'|^[A-Z]\s|\s[A-Z]'|[A-Z][a-z]|[a-z]{1,}|(^.+\s[A-Z]$)|(\.)|(-)|(--)|(\?)|(\!)|(,)|(\.\.\.)|(\()|(\))|(\')|("
")|(“)|(”)|(«)|(»)|(‘)|(’)|(<)|(>)|(\{)|(\})|(\[)|(\])|(;)|(:)|(#)|(#)|(\*)|(¦)|(\+)|(%)|(¬)|(&)|(|)|(¢)|($)|(£)|(`)|(^)|(€)|[0-9]|[0-9]+",
""), "[A-Z]{2,}") = FALSE, "", REGEXREPLACE(N3: N,
"(^[A-Z]).+(,).+(\s[a-z]\s)|(^[A-Z][a-z]).+(\s[a-z][a-z]\s)|(^[A-Z]\s).+(\.\s[A-Z][a-z][a-z]\s)|[A-Z][a-z].+[0-9]|[A-Z][a-z].+[0-9]+|(^[A-Z]).+(\s[A-Z]$)|(^[A-Z]).+(\s[A-Z][a-z]).+(\s[A-Z])|(\s[A-Z][a-z]).+(\s[A-Z]\s).+(\s[A-Z])|(^[A-Z][a-z]).+(\s[A-Z]$)|(\s[A-Z]\s).+(\s[A-Z]\s)|(\s[A-Z]\s)|^[A-Z].+\s[A-Z]((\?)|(\!)|(\.)|(\.\.\.))|^[A-Z]'|^[A-Z]\s|\s[A-Z]'|[A-Z][a-z]|[a-z]{1,}|(^.+\s[A-Z]$)|(\.)|(-)|(--)|(\?)|(\!)|(,)|(\.\.\.)|(\()|(\))|(\')|("
")|(“)|(”)|(«)|(»)|(‘)|(’)|(<)|(>)|(\{)|(\})|(\[)|(\])|(;)|(:)|(#)|(#)|(\*)|(¦)|(\+)|(%)|(¬)|(&)|(|)|(¢)|($)|(£)|(`)|(^)|(€)|[0-9]|[0-9]+",
"")))
Going one by one over all exceptions and adding their respective regex formulations to the front of the multiple pipes separated regexes in the regexextract function.
#Wiktor Stribiżew any simplifying suggestions would be very welcome.
Found some missing and fixed them.
1st EDIT:
A simpler version though still quite lengthy:
= ArrayFormula(IF(REGEXMATCH(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
REGEXREPLACE(REGEXREPLACE(P3: P, "[a-z,]",
" "), "-|\.", " "), "(^[A-Z]\s)", " "
), "(\s[A-Z]\s)", " "),
"\sI'|\sI\s|^I'|^I\s|\sI(\.|\?|\!)|\sI$|\sA\s|^A\s|\.\.\.|\.|-|--|,|\?|\!|\.|\(|\)|'|"
"|:|;|\'|“|”|«|»|‘|’|<|>|\{|\}|\[|\]|#|#|\*|¦|\+|%|¬|&|\||¢|$|£|`|^|€|[0-9]|[0-9]+",
" "), "[A-Z]{2,}") = FALSE, " ", REGEXREPLACE(
REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(REGEXREPLACE(
P3: P, "[a-z,]", " "), "-|\.", " "),
"(^[A-Z]\s)", " "), "(\s[A-Z]\s)", " "),
"\sI'|\sI\s|^I'|^I\s|\sI(\.|\?|\!)|\sI$|\sA\s|^A\s|\.\.\.|\.|-|--|,|\?|\!|\.|\(|\)|'|"
"|:|;|\'|“|”|«|»|‘|’|<|>|\{|\}|\[|\]|#|#|\*|¦|\+|%|¬|&|\||¢|$|£|`|^|€|[0-9]|[0-9]+",
" ")))
From this example:
Multiple regex matches in Google Sheets formula
I trying to write a regex to match the following at the beginning of a new line
- a number followed by parantheses e.g. 2) or 8)
- a number followed by period e.g. 5
- the character '-'
- the character '*'
the following strings should match
"1. Sorting function. If you have a long checklist it's very difficult."
"5) This is another example"
"-this is yet another one"
"* last item in the list"
I have tried this but it doesn't quite get me what I am looking for.
re.findall(r'(?m)\s*^[-*(\d.)(\d\))]',item)
Try
re.findall(r'^\s*(\d+(\)|\.)|-|\*)', item, re.MULTILINE)
It will match all sequences of numbers followed by a closing parenthesis or period as well as dashes and stars at the beginning of the line.
Example: https://regex101.com/r/cR2lZ5/6
Assuming that your quote marks " are not included, and that each line is a separate string,
^\d\.|^\d\)|^\-|^\*
Would be the regular expression. | is OR, \d is a digit, and you escape the special characters ".", ")", "-", and "*" by putting a backslash in front of them.
You can test your regular expressions here. Good luck!