I've been doing a lot of finding and replacing in Sublime Text and decided I needed to learn RegEx. So far, so good. I'm no expert by any means, but I'm learning quickly.
The trouble is knowing how to replace open-ended HTML matches.
For example, I wanted to find all <button>s that didn't have a role attribute.
After some hacking and searching, I came up with the following pattern (see in action):
<button(?![^>]+role).*?>(.*?)
Great! Except, in the code base I'm working in, there are tons of results.
How do I do about replacing the results safely by injecting role=button at the end of <button, just before the closing > in the opening tag?
Desired results
Before: <button type="button">
After: <button type="button" role="button">
Before: <button class="btn-lg" type="button">
After: <button class="btn-lg" type="button" role="button">
You can capture everything before the ending > and put it back, before the insertion of role=button:
<(button(?![^>]+role).*?)>
This captures everything in the tag.
Replace by:
<$1 role="button">
The $1 contains what the first regex captured.
See the updated regexr.
Related
I'm kinda lost with Regex and would appreciate some help.
Target: To extract the URL between the two " ", without returning the " themselves.
Base string:
<span class="fa fa-eye fa-fw poptip" data-toggle="tooltip" title="" data-original-title="Inspect in-game"></span>
I came up with the following solution:
(="(.*)" class="btn btn-xs btn-default ")
Too bad it is matching
="somerandomurl" class="btn btn-xs btn-default "
Is it possible to match only the inner result, without the delimiters?
somerandomurl
Since this should be included in a script that should run as fast as possible, maybe there is a faster and better approach? In reality this regex search will be applied on a complete website.
Using RegEx to match markup is usually not a good idea. If you have the option you might want prefer a HTML / DOM parser.
That said your RegEx should match the sample in most languages. But it defines two sets of parenthesis so the result you want is located in group 2. Both group 0 and 1 will hold the full match.
If you have trouble reading the correct result group, please provide some additional information like which language your're working in and preferabbly a snippet.
I have a text like this:
22 <a data-event="event:noted:tasks" class="btn btn-default show-if-closed" title="Noted Tasks">
25 <a data-event="event:until-today" class="btn btn-default show-if-closed" title="Until Today">
28 <a data-event="event:until-one-week" class="btn btn-default show-if-closed" title="Until One Week">
31 <a data-event="event:until-one-month" class="btn btn-default show-if-closed" title="Until One Month">
Now I want to replace the entire text except the string that is inside the title-Tag.
After replacing the text I would like to get lines like this:
Noted Tasks
Until Today
Until One Week
Until One Month
What Regex-Pattern do I need to match the text except the title-Values? The pattern should be universal, not limited to a-Tags
Use the following regex:
^.*?title="([^"]*)".*$
and replace with \1. In that way, the entire line is replaced with the desired info.
Test here.
Please note that it is better to use proper HTML parsers to... ahem... parse HTML.
The pattern should be universal, not limited to a-Tags
Considering that the word title can appear anywhere on a web page (normal text, class names, keywords...), only a dedicated HTML parser will help you in the long run.
I have a bunch of html-files that I concat and want to get the actual contents only.
However, I'm having some trouble with finding the correct regex for that. Basically I'm trying to remove everything before, in between and after certain boundaries. Its somewhat similar to Regular expression to match a line that doesn't contain a word? however as I feel more complex. I'm having no luck.
Source-Data:
Stuff I dont need before
<div id="start">
blablabla11
blablabla12
<div id="end">
Stuff I dont need in the middle1
<div id="start">
blablabla21
blablabla22
<div id="end">
Stuff I dont need in the middle2
<div id="start">
blablabla31
blablabla32
<div id="end">
Stuff I dont need in the end
Desired result:
<div id="start">
blablabla11
blablabla12
<div id="end">
<div id="start">
blablabla21
blablabla22
<div id="end">
<div id="start">
blablabla31
blablabla32
<div id="end">
Context:
I'm working in Sublime (Mac) -> Perl Regex
My current approach is based on inverse matching / regex-lookarounds (I know, there is lots of discussion about wording/methods/uglyness etc around this topic, however I must not care as I need to get the job done) :
Find: (?s)^((?!(<div id="start">)(?s)(.*?)(<div id="end">)).)*$
Replace: $3
And many more variants, I've been testing and playing around.
However, it yields to:
blablabla11
blablabla12
<div id="start">
blablabla21
blablabla22
<div id="start">
blablabla31
blablabla32
<div id="start">
Nice, but not there yet. And whatever I'm trying I'm stumbling into other problems. Noob at work I guess.
Thanks a gazillion for your help guys!
Chris
EDIT:
Thank you for the first answers! However I must admit that my minimal example is a bit misleading (because too easy). In reality I am facing hundrets of complex and diverse html-files concatenated into one single large file.
The only common bits are that the content of every html-file starts with a known string (here simplified as ) and ends with a known string (here simplified as ). And the content as such obviously has loads of different tags etc. So just testing for opening and closing tags sadly wont cut it
You may look for
(?s).*?(<div id="start">.*?<div id="end">)(?:(?:(?!<div id="start">).)*$)?
and replace with $1\n\n. See regex demo.
Details
(?s) - DOTALL modifier, . now matches any char
.*? - any 0+ chars, as few as possible
(<div id="start">.*?<div id="end">) - Group 1: <div id="start">, any 0+ chars as few as possible, and <div id="end">
(?:(?:(?!<div id="start">).)*$)? - an optional non-capturing group matching 1 or 0 occurrence of
(?:(?!<div id="start">).)* - any char, 0 or more occurrences, that does not start a <div id="start"> char sequence (aka tempered greedy token)
$ - end of string.
When I woke up this morning, I didn’t know a stroke of regex. By the time I went to Mass, I’d been able to cobble together this regex to find occurrences of ‘Mph’ in an html document.
(?i)(?<=[\s|\d])mph+
If I run it against the following test data:
<div class="vsMph">
<p>95 Mph</p>
</div>
<div class="vsMph">
<p>95Mph</p>
</div>
It correctly matches:
‘ Mph’ and
‘Mph’
And equally correctly leaves the ‘vsMph’ alone, which is exactly what I want. Eventually, I'm going to use the same technique to match knots, ft, in, km and so on.
I’m executing this expression in in Sublime Text 3 using RegReplace and ultimately, what I hope to do is to use this regular expression to find all occurrences of ‘Mph’ preceded by a space or a digit and:
Enclose ‘Mph’ in <abbr> tags.
Add a space between the digit and the
opening <abbr> tag if there was no space between the last digit and
'Mph' originally.
In other words, I want to convert the above test data to:
<div class="vsMph">
<p>95 <abbr title="Miles per hour">Mph</abbr></p>
</div>
<div class="vsMph">
<p>95 <abbr title="Miles per hour">Mph</abbr></p>
</div>
I can get RegReplace to add the <abbr> tags as described in 1. above, but I’ve searched around on Google and I can’t find anything that tells me how to conditionally insert a space in a regex replace.
So I’m wondering. Is it possible in the first place to conditionally add a space in a regex replacement and if so how do I do it, or do I have to search for ‘\sMph’ and ‘\dMph’ and replace them separately?
Regards.
I would suggest using groups to match Mph. You could search for simply the following regex:
(\d)(\s)?(Mph)
Then replace using groups
$1 <abbr title="Miles per hour">$3</abbr>
output:
<div class="vsMph">
<p>95 <abbr title="Miles per hour">Mph</abbr></p>
</div>
<div class="vsMph">
<p>95 <abbr title="Miles per hour">Mph</abbr></p>
</div>
Yes I know, don't parse html with regex. That said:
I am trying to capture content between any tag with the word "Title" in the first tag.
I started with:
(?P<QUALIFY_TITLE><(.*?)(title)(.*?)>)(.*?)?(?<CAPTURE>KnownTermIWant)(.*?)(\<\/.*?>)
Where the Named Group Capture is a known word/string I am looking for. I also capture for research sake the QUALIFY_TITLE Name group. I do this because I don't want the string/term unless I 'qualify' it in this way.
However, if I have part of an html that looks like this:
<div class="wwm"><div class="inbox"><input name="language-id" type="hidden" id="language-id" value="" /><input name="widget-page-handle" type="hidden" id="widget-page-handle" value="wwm4widget_post" /><input name="email-page-handle" type="hidden" id="email-page-handle" value="wwm4widget_emailpopup" /><div id="divWidget" style="display: block;" class="vhWidget"> <div id="divShareLink" style="display: block;" class="shareLink"><div id="divTitle" class="title">KnownTermIWant</title>
Although I get the CAPTURE String I want (KnownTermIWant), the Qualify string starts from the very first "
I am trying to have the QUALIFY_TITLE start/capture from the last "<" before the title not the first in other words QUALIFY TITLE should be:
<div id="divTitle
or even
<div id="divTitle" class="title">
but I am currently getting
<div class="wwm"><div class="inbox"><input name="language-id" type="hidden" id="language-id" value="" /><input name="widget-page-handle" type="hidden" id="widget-page-handle" value="wwm4widget_post" /><input name="email-page-handle" type="hidden" id="email-page-handle" value="wwm4widget_emailpopup" /><div id="divWidget" style="display: block;" class="vhWidget"> <div id="divShareLink" style="display: block;" class="shareLink"><div id="divTitle" class="title"
The problem is that a regex-search will try to match at the first possible opportunity, and non-greedy quantifiers (*? instead of *) do not affect whether something is a match. For example, given the string abcd, the regex .*?d will match the whole thing, because .*? will still match as much as it needs to in order to ensure that the regex matches.
Do you see what I mean?
So you need to make your subexpressions more precise; for example, instead of <(.*?)(title)(.*?)>, you should write <([^>]*)(title)([^>]*)>.
The problem
There's only one problem here, you are matching exactly what you've asked for :)
The process
If you want to match only the last tag, ask yourself this question:
"What is inside every preceding tag, but not inside the one I want?"
The conclusion
The answer is the open/close tags themselves:
(?P<QUALIFY_TITLE><([^<>]*?)(title)(.*?)>)(.*?)?(?<CAPTURE>KnownTermIWant)(.*?)(\<\/.*?>)
^^^^^
Your code was quite a big mess, but I'm going to answer the question in the title, in a much more simplified way:
In this sample code:
<div>Example text<div>Foo bar</div> Hello world <div>Lorem ipsum</div></div> hi
if you want to match from the first <div> to the last </div>, you could just use a greedy quantifier, such as + or *:
/<div>(.*)<\/div>/
That will match the whole string, until the very last </div>.
Demo
If this doesn't answer your question, the complexity of the regular expression would quickly get higher very fast (it's bascially exponentially more complex for extra requirements), so like you said in your very first line, just use a parser.