I need to remove all anchors (anchor text remains) from the string except those anchors that have href="/"
This is example text:
Fusce imperdiet nulla ut sapien aliquet, congue varius dui consectetur. This link remains et blandit nisl. Curabitur euismod volutpat urna, eget dignissim libero cursus rhoncus. Nulla ac test sollicitudin link from this text should be removed. Maecenas sodales vel lorem eu placerat.
Here is regex that I think should work (using negative lookahead):
/<a.*?(?!href=["']\/["'])>(.*?)</a>/gi
Yet it selects both anchors.
try regex <a(?!.*href=["']\/["']).*?>(.*?)<\/a>
The negative lookahead (?!.*href=["']\/["']) won't capture the tag with href="/"
Regex
Related
So I have this text that I am trying to parse with Regex:
Name: Test Data 1
Description: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec feugiat nulla id nisi venenatis blandit.
Donec blandit egestas orci, at tristique dui vehicula in. Maecenas fringilla fringilla enim, in pulvinar ex gravida
in. Nam cursus facilisis ante, sed tristique nisl sagittis sed. In auctor felis id neque suscipit ullamcorper. Nunc
faucibus elit sed metus vestibulum, ullamcorper pulvinar nisi auctor. Praesent sodales orci mauris, eget dapibus
mauris sodales in. Ut iaculis, ante vitae ullamcorper semper, metus tortor auctor purus, eu convallis nulla lacus
in tellus. Phasellus feugiat tempus neque, in fringilla nisi scelerisque sed. Donec elementum diam nec mattis dignissim.
I am trying to parse it to load it into a database.
With this expression, I am trying to get a match on the "Name" and "Description" parameters but also trying to get a match on the parameter value as well (which can sometimes be multi-line).
(.*):\s(.*)
I have been searching for a while now and I cannot seem to be able to make it match the whole paragraph but stop when it hits a blank line.
I would like the result to be as follows:
1st Match
Group 1: Name
Group 2: Test Data 1
2nd Match
Group 1: Description
Group 2: Description value with multi-line
https://regex101.com/r/mG2ms9/3
Thanks
You can use the following:
(.*?):\s([\s\S]*?)(?=\n(?:\n|\w|$))
Here it is on regex101.
[\s\S] matches any character, even a new line (whereas '.' does not, by default).
Then we're matching as few characters as possible (*?) up until the point where the next line is either blank (\n), starts with a word character (\w), or is the end of the string ($).
We can get away with the \w option since all of the new lines in the description parameter are followed by a space. If this isn't always the case, you could replace \w with something like .*: to check instead if the next line contains ':' and stop if so.
Note that I disabled multi-line mode; it's not suitable here.
I’m having a hard time figuring out the regex code in Google Sheets to check a cell then return everything including new lines \n and returns \r before a certain pattern \*+.
A little more background: I'm using REGEXEXTRACT(A:A,"...") format inside a bigger ArrayFormula so that it automatically updates when a new row is added. This one’s working properly. It’s only the regex part I’m having trouble with.
So, for the purpose of this question, let's say I'm only worried about extracting the data from the A1 cell before a certain pattern and return that value in cell B1. Which brings us to this code in cell B1:
REGEXEXTRACT(A1,"...")
For example, this is how my A1 cell looks like:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan risus id ex dapibus sodales.
Curabitur dui lacus, tincidunt vel ligula quis, volutpat mattis eros.
In quis metus at ex auctor lobortis. Aliquam sed nisi purus. Sed cursus odio erat, ut tristique sapien interdum interdum. Morbi vel sollicitudin ante, non pellentesque libero.
***********
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Aenean egestas urna facilisis massa posuere, quis accumsan erat ornare.
Curabitur at dapibus nibh. Nam nec vestibulum ligula. Phasellus bibendum mi urna, ac hendrerit libero interdum non. Suspendisse semper non elit aliquam auctor.
Morbi vel sem tortor. Donec a sapien quis erat condimentum consequat in ut sem. Quisque in tellus sed est lobortis ultricies sed vitae enim.
I want to return this value in B1:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan risus id ex dapibus sodales.
Curabitur dui lacus, tincidunt vel ligula quis, volutpat mattis eros.
In quis metus at ex auctor lobortis. Aliquam sed nisi purus. Sed cursus odio erat, ut tristique sapien interdum interdum. Morbi vel sollicitudin ante, non pellentesque libero.
Which is basically anything before the pattern *******. In Python, I can add the re.DOTALL to the .* but I can't get this to work in Google Sheets.
To make a dot match line breaks, you need to add (?s) to the pattern. To match any char, you may use a .. To match up to the leftmost occurrence, use lazy quantifier, *?. To actually extract a substring you need, wrap the part of the pattern you are interested in getting with capturing parentheses.
So, to match up to the first ******* substring, you may use
(?s)^(.*?)\*\*\*\*\*\*\*
or (?s)^(.*?)\*{7}. See the regex demo (note that Go regex engine is also RE2, so you may test your patterns there, at regex101.com).
(?s) - a DOTALL modifier
^ - start of string
(.*?) - Group 1: any 0+ chars as few as possible
\*\*\*\*\*\*\* - 7 literal asterisk symbols.
Note you cannot rely on a negated character class (that matches line breaks) if your substring may contain * chars, that is, ^([^*]*)\*\*\*\*\*\*\* won't work in those cases.
If you just want to match any chars up to the first * in the string, your regex will simplify greatly to
^([^*]+)
It matches
^ - start of string
([^*]+) - Capturing group 1: one or more chars other than *.
re.DOTALL flag in python corresponds to (?s) single line mode flag in re2.
Python:
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
re2:
Flags: s let . match \n (default false)
So,
=REGEXEXTRACT(A1,"(?s)(.*?)\*")
This corresponds to re.findall()
Not regex though might suit someone wanting the same result but less particular about the method:
=ArrayFormula(LEFT(A1:A,Find("***********",A1:A)-3))
If you really only want to match everything before the first *:
=REGEXEXTRACT(A1;"[^*]*")
If you want to allow a single star in the text and only stop at multiple (2 or more) stars (possibly divided by newlines) at the beginning of a line, you could try:
=REGEXEXTRACT(A1;"(?s)^(.*)\n(\*\n?){2,}")
But you would have to strip the stars. E.g.
=REGEXREPLACE(REGEXEXTRACT(A1;"(?s)^(.*)\n(\*\n?){2,}"); "\n(\*\n?){2,}"; "")
A lookahead does not seem to work in Google Sheets.
I want to search for a string pattern in a line and if found replace the whole line with the matched string pattern.
My string pattern starts with 2 alpha characters and followed with either 5 or 6 numeric characters. Ex. HR12345 or HR123456
Here is sample of how the lines with the pattern looks like.
Class cum accumsan. In. Pellentesque nec magna interdum fusce metus, massa aliquam HR032145
Amet commodo arcu, felis orci Per. Facilisis blandit rhoncus hac porttitor ut duis eu HR32145
Mattis quis magna, suspendisse HR32146 aucibus vel, fames Nonummy molestie penatibus ad.
Nascetur mattis ad egestas et nec HR032111 Penatibus posuere. Posuere.
Inceptos consectetuer neque nullam HR032114. rutrum Eleifend.
Netus tortor conubia parturient sapien interdum adipiscing sociis luctus integer HR032113
HR032112 Mattis erat a ante. Rutrum. Mattis risus fames. Euismod sapien morbi habitasse.
Platea sapien vitae Risus. Erat dictum elit dapibus convallis.
Facilisis ut dis morbi integer fusce dolor Et class Primis iaculis.
Aptent per risus phasellus HR032188
After search replace it should look like
HR032145
HR32145
HR32146
HR032111
HR032114
HR032113
HR032112
Platea sapien vitae Risus. Erat dictum elit dapibus convallis.
Facilisis ut dis morbi integer fusce dolor Et class Primis iaculis.
HR032188
Try the following simple find and replace:
Find:
^.*(HR\d+).*$
Replace:
$1
This replacement will only happen with lines containing HR followed by one or more digits. Hence, the lines which do not have this pattern will not even match, and no replacement will take place there.
I have a certain amount of content like this:
<p><strong>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut ullamcorper enim ut nulla fringilla, non elementum nunc dapibus. Donec porta a lorem in vestibulum. Aenean viverra vulputate finibus. Sed malesuada nibh vitae enim luctus, at placerat diam vehicula.</strong></p>
<p>Quisque eu nisl sed tellus congue aliquet ac id risus. Etiam eget nisi ac lectus cursus suscipit. Mauris a dictum justo. Aliquam eget mi vel nunc imperdiet ultricies.</p>
<iframe width="480" height="270" frameborder="0" src="https://www.youtube.com/embed/EgqUJOudrcM" allowfullscreen="" ></iframe>
All I am trying to do is get the YouTube video ID.
So far, I have come up with the following Regular Expression:
/<iframe.*src=["\'].*youtube\.com\/embed\/(.*)["\'] ?>/
This works if the src attribute is the last attribute in the tag, otherwise it doesn't. How can my regular expression be written so as to overcome this?
Works in this case
But not in this one
As you can see, in the second example, my Regex also matches the attribute after src. I know why this happens, I just can't work out how to prevent it.
I'm certainly no Regex expert, so any suggestions to improve what I currently have are welcome.
With this one:
<iframe.*?src=".*?youtube\.com\/embed\/(\w+)
The .*? avoid matching to much and stop on first src attribute
Then it match the url straightforward.
Edit: You just want the id, not full url
You can use the following regex:
<iframe[^>]*src=\"[^\"]+\/([^\"]+)\"[^>]*>
Is there a regex to match "all characters including newlines"?
For example, in the regex below, there is no output from $2 because (.+?) doesn't include new lines when matching.
$string = "START Curabitur mollis, dolor ut rutrum consequat, arcu nisl ultrices diam, adipiscing aliquam ipsum metus id velit. Aenean vestibulum gravida felis, quis bibendum nisl euismod ut.
Nunc at orci sed quam pharetra congue. Nulla a justo vitae diam eleifend dictum. Maecenas egestas ipsum elementum dui sollicitudin tempus. Donec bibendum cursus nisi, vitae convallis ante ornare a. Curabitur libero lorem, semper sit amet cursus at, cursus id purus. Cras varius metus eu diam vulputate vel elementum mauris tempor.
Morbi tristique interdum libero, eu pulvinar elit fringilla vel. Curabitur fringilla bibendum urna, ullamcorper placerat quam fermentum id. Nunc aliquam, nunc sit amet bibendum lacinia, magna massa auctor enim, nec dictum sapien eros in arcu.
Pellentesque viverra ullamcorper lectus, a facilisis ipsum tempus et. Nulla mi enim, interdum at imperdiet eget, bibendum nec END";
$string =~ /(START)(.+?)(END)/;
print $2;
If you don't want add the /s regex modifier (perhaps you still want . to retain its original meaning elsewhere in the regex), you may also use a character class. One possibility:
[\S\s]
a character which is not a space or is a space. In other words, any character.
You can also change modifiers locally in a small part of the regex, like so:
(?s:.)
Add the s modifier to your regex to cause . to match newlines:
$string =~ /(START)(.+?)(END)/s;
Yeap, you just need to make . match newline :
$string =~ /(START)(.+?)(END)/s;
You want to use "multiline".
$string =~ /(START)(.+?)(END)/m;