Regex to capture everything before pattern in Google Sheets - regex

I’m having a hard time figuring out the regex code in Google Sheets to check a cell then return everything including new lines \n and returns \r before a certain pattern \*+.
A little more background: I'm using REGEXEXTRACT(A:A,"...") format inside a bigger ArrayFormula so that it automatically updates when a new row is added. This one’s working properly. It’s only the regex part I’m having trouble with.
So, for the purpose of this question, let's say I'm only worried about extracting the data from the A1 cell before a certain pattern and return that value in cell B1. Which brings us to this code in cell B1:
REGEXEXTRACT(A1,"...")
For example, this is how my A1 cell looks like:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan risus id ex dapibus sodales.
Curabitur dui lacus, tincidunt vel ligula quis, volutpat mattis eros.
In quis metus at ex auctor lobortis. Aliquam sed nisi purus. Sed cursus odio erat, ut tristique sapien interdum interdum. Morbi vel sollicitudin ante, non pellentesque libero.
***********
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Aenean egestas urna facilisis massa posuere, quis accumsan erat ornare.
Curabitur at dapibus nibh. Nam nec vestibulum ligula. Phasellus bibendum mi urna, ac hendrerit libero interdum non. Suspendisse semper non elit aliquam auctor.
Morbi vel sem tortor. Donec a sapien quis erat condimentum consequat in ut sem. Quisque in tellus sed est lobortis ultricies sed vitae enim.
I want to return this value in B1:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan risus id ex dapibus sodales.
Curabitur dui lacus, tincidunt vel ligula quis, volutpat mattis eros.
In quis metus at ex auctor lobortis. Aliquam sed nisi purus. Sed cursus odio erat, ut tristique sapien interdum interdum. Morbi vel sollicitudin ante, non pellentesque libero.
Which is basically anything before the pattern *******. In Python, I can add the re.DOTALL to the .* but I can't get this to work in Google Sheets.

To make a dot match line breaks, you need to add (?s) to the pattern. To match any char, you may use a .. To match up to the leftmost occurrence, use lazy quantifier, *?. To actually extract a substring you need, wrap the part of the pattern you are interested in getting with capturing parentheses.
So, to match up to the first ******* substring, you may use
(?s)^(.*?)\*\*\*\*\*\*\*
or (?s)^(.*?)\*{7}. See the regex demo (note that Go regex engine is also RE2, so you may test your patterns there, at regex101.com).
(?s) - a DOTALL modifier
^ - start of string
(.*?) - Group 1: any 0+ chars as few as possible
\*\*\*\*\*\*\* - 7 literal asterisk symbols.
Note you cannot rely on a negated character class (that matches line breaks) if your substring may contain * chars, that is, ^([^*]*)\*\*\*\*\*\*\* won't work in those cases.
If you just want to match any chars up to the first * in the string, your regex will simplify greatly to
^([^*]+)
It matches
^ - start of string
([^*]+) - Capturing group 1: one or more chars other than *.

re.DOTALL flag in python corresponds to (?s) single line mode flag in re2.
Python:
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
re2:
Flags: s let . match \n (default false)
So,
=REGEXEXTRACT(A1,"(?s)(.*?)\*")
This corresponds to re.findall()

Not regex though might suit someone wanting the same result but less particular about the method:
=ArrayFormula(LEFT(A1:A,Find("***********",A1:A)-3))

If you really only want to match everything before the first *:
=REGEXEXTRACT(A1;"[^*]*")
If you want to allow a single star in the text and only stop at multiple (2 or more) stars (possibly divided by newlines) at the beginning of a line, you could try:
=REGEXEXTRACT(A1;"(?s)^(.*)\n(\*\n?){2,}")
But you would have to strip the stars. E.g.
=REGEXREPLACE(REGEXEXTRACT(A1;"(?s)^(.*)\n(\*\n?){2,}"); "\n(\*\n?){2,}"; "")
A lookahead does not seem to work in Google Sheets.

Related

Regex Match multiple lines and provide condition for end of match

So I have this text that I am trying to parse with Regex:
Name: Test Data 1
Description: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec feugiat nulla id nisi venenatis blandit.
Donec blandit egestas orci, at tristique dui vehicula in. Maecenas fringilla fringilla enim, in pulvinar ex gravida
in. Nam cursus facilisis ante, sed tristique nisl sagittis sed. In auctor felis id neque suscipit ullamcorper. Nunc
faucibus elit sed metus vestibulum, ullamcorper pulvinar nisi auctor. Praesent sodales orci mauris, eget dapibus
mauris sodales in. Ut iaculis, ante vitae ullamcorper semper, metus tortor auctor purus, eu convallis nulla lacus
in tellus. Phasellus feugiat tempus neque, in fringilla nisi scelerisque sed. Donec elementum diam nec mattis dignissim.
I am trying to parse it to load it into a database.
With this expression, I am trying to get a match on the "Name" and "Description" parameters but also trying to get a match on the parameter value as well (which can sometimes be multi-line).
(.*):\s(.*)
I have been searching for a while now and I cannot seem to be able to make it match the whole paragraph but stop when it hits a blank line.
I would like the result to be as follows:
1st Match
Group 1: Name
Group 2: Test Data 1
2nd Match
Group 1: Description
Group 2: Description value with multi-line
https://regex101.com/r/mG2ms9/3
Thanks
You can use the following:
(.*?):\s([\s\S]*?)(?=\n(?:\n|\w|$))
Here it is on regex101.
[\s\S] matches any character, even a new line (whereas '.' does not, by default).
Then we're matching as few characters as possible (*?) up until the point where the next line is either blank (\n), starts with a word character (\w), or is the end of the string ($).
We can get away with the \w option since all of the new lines in the description parameter are followed by a space. If this isn't always the case, you could replace \w with something like .*: to check instead if the next line contains ':' and stop if so.
Note that I disabled multi-line mode; it's not suitable here.

Regex select all anchors except some

I need to remove all anchors (anchor text remains) from the string except those anchors that have href="/"
This is example text:
Fusce imperdiet nulla ut sapien aliquet, congue varius dui consectetur. This link remains et blandit nisl. Curabitur euismod volutpat urna, eget dignissim libero cursus rhoncus. Nulla ac test sollicitudin link from this text should be removed. Maecenas sodales vel lorem eu placerat.
Here is regex that I think should work (using negative lookahead):
/<a.*?(?!href=["']\/["'])>(.*?)</a>/gi
Yet it selects both anchors.
try regex <a(?!.*href=["']\/["']).*?>(.*?)<\/a>
The negative lookahead (?!.*href=["']\/["']) won't capture the tag with href="/"
Regex

sed: How to substitute within a match?

Using regexr, I wrote the expression /[\.!?] [A-Z]/g to match sentences using 3 assumptions:
Sentences end with punctuation: [.,!?] (I'm not sure how to match double punctuation marks or combinations...)
One or more spaces always follow the punctuation mark.
The next sentence begins with a CAPITAL letter. (True 99% of the time, except for lowercase nouns such as iDevices)
Using sed, I'd like to take these matches, and substitute the space(s) with a \n character. I can do an after match $' and a before match $`, but how can I replace within a match?
If there is a better way of splitting texts into one sentence per line, I'm open to alternatives.
No bashisms: for Linux, OS X, and BSD
Input:
Vivamus fermentum semper porta. Nunc diam velit, adipiscing ut
tristique vitae, sagittis vel odio. Maecenas convallis ullamcorper
ultricies. Curabitur ornare, ligula semper consectetur sagittis, nisi
diam iaculis velit, id fringilla sem nunc vel mi.
Output:
Vivamus fermentum semper porta.
Nunc diam velit, adipiscing ut tristique vitae, sagittis vel odio.
Maecenas convallis ullamcorper ultricies.
Curabitur ornare, ligula semper consectetur sagittis, nisi diam iaculis velit, id fringilla sem nunc vel mi.
You can use this replacement:
sed 's/\([.!?][.!?]*\) *\([A-Z]\)/\1\n\2/g;' file
\(...\) delimits a capture groups and \1 is a reference to the captured content.
The OSX version of sed doesn't interpret \n as a newline, you must use instead the sequence \1'$'\n\\2 as replacement string.
A more POSIX way consists to write:
sed 's/\([.!?][.!?]*\) *\([A-Z]\)/\1\
\2/g;' file
with an escaped newline as suggested by #cliffordheath.
Note that the dot doesn't need to be escaped inside a character class.
You need to use capture groups with \( and \) to re-insert the punctuation and initial letter. This example allows a following sentence to start with any alphanumeric (but requires at least one space to avoid messing up decimal numbers):
$ sed -e 's/\([.!?]\) *\([[:alnum:]]\)/\1\
\2/g'
foo. bat! baz? foo, bar.
foo.
bat!
baz?
foo, bar.
I hope this helps.

Simple email address check using preg_match

I have a simple regular expression for finding email addresses in a text, but even though I don't see an error, it doesn't work.
$addr=array();
$t='Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean fermentum risus id tortor. Morbi leo mi, nonummy eget tristique non, rhoncus non leo. Donec quis nibh at felis congue commodo. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos hymenaeos aaa#bbb.com. Aliquam ccc#ddd.net ornare wisi eu metus.';
if(preg_match_all('~[A-Z0-9._%-]+#[A-Z0-9.-]+\.[A-Z]{2,4}~',$t,$addr, PREG_SET_ORDER)){
echo 'found';
}
I have also tried this version I found, but it didn't work either:
if(preg_match_all('/^[A-Z0-9._%-]+#[A-Z0-9._%-]+\.[A-Z]{2,4}$/',$t,$addr, PREG_SET_ORDER)){
You are matching emails that are all uppercase. You need to either do [A-Za-z], or set the case insensitive flag on preg_match
This would also work for the match pattern and is shortest:
[0-z.%-]+#[0-z.-]+\.[A-z]{2,4}
This works because 0-z covers A-Z, a-z, 0-9 and _, also A-z covers A-Z and a-z
I develop in ruby and you can see this 'in action' at http://rubular.com/r/PdbH1BjWMs
Include the lower case letter class:
if(preg_match_all('~[a-zA-Z0-9._%-]+#[a-zA-Z0-9.-]+\.[A-Z]{2,4}~',$t,$addr, PREG_SET_ORDER)){
echo 'found';
}
...note a-z.

regex: match until the first ] character is found

I'm trying to match untill the first occurence of ] is found but can't seem to make it work, if someone could help me figure this out.
The string I'm matching against:
[plugin:tabs][tab title="test"]Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam sit amet nisl nisl. Ut interdum libero vitae quam ultricies et lacinia elit aliquet. Praesent tincidunt, sem tempus feugiat feugiat, turpis tellus scelerisque erat, sit amet feugiat neque arcu ac lectus. Sed at mi et elit interdum scelerisque vitae eu felis.[/tab][/plugin]
What it should match:
[plugin:tabs]
What it keeps matching:
[plugin:tabs][tab title="test"]
The regex:
(\[plugin:(?<identifier>[^\s]+)(?<parameters>.*?)\])
EDIT:
What it should also match:
[plugin:tabs test="test"]
You just need to add ? like so (lazy match, will match as few characters as possible):
(\[plugin:(?<identifier>[^\s]+?)(?<parameters>.*?)\])
^
Although the (?<parameters>.*?) part is unnecessary then.
So your final Regex would look like this:
(\[plugin:(?<identifier>[^\s]+?)\])
€dit: See #stema's answer.
Try this here
(\[plugin:(?<identifier>[^\]\s]+)(?<parameters>.*?)\])
See it here on Regexr
This avoids additionally to the whitespace characters also the ] character in the first named group.
If you don't need the first capturing group you can make it a non-capturing group by adding ?: right after the opening bracket.
(?:\[plugin:(?<identifier>[^\]\s]+)(?<parameters>.*?)\])
To avoid that the space in between is captured by the second group, just match optional whitespace between the two groups
(?:\[plugin:(?<identifier>[^\]\s]+)\s*(?<parameters>.*?)\])
See it here on Regexr
With any language that supports lookbehinds that will be your easiest solution.
/(^(?<!])*)/