Regex match between two regex expressions - regex

This has been driving me crazy, I can't find a solution that works!
I'm trying to do a regex between a couple of tags, bad idea I've heard but necessary this time :P
What I have at the start is a <body class="foo"> where foo can vary between files - <body.*?> search works fine to locate the only copy in each file.
At the end I have a <div id="bar">, bar doesn't change between files.
eg.
<body class="foo">
sometext
some more text
<maybe even some tags>
<div id="bar">
What I need to do is select everything between the two tags but not including them - everything between the closing > on body and the opening < on div - sometext to maybe even some tags.
I've tried a bunch of things, mostly variations on (?<=<body.*>)(.*?)(?=<div id="bar">) but I'm actually getting invalid expressions at worst on notepad++, http://regexpal.com/ and no matches at best.
Any help appreciated!

You are attempting to implement variable-length lookbehind in which most regular expression languages and notepad++ does not support. I assume you are using notepad++ so you can use the \K escape sequence.
<body[^>]*>\K.*?(?=<div id="bar">)
The \K escape sequence resets the starting point of the reported match and any previously consumed characters are no longer included. Make sure you have the . matches newline checkbox checked as well.
Alternatively, you can use a capturing group and avoid using lookaround assertions.
<body[^>]*>(.*?)<div id="bar">
Note: Using a capturing group, you can refer to group index "1" to get your match result.

Use the following pattern:
/<body[^>]*>(.*?)<div id="bar">/

Related

Vim Regex for html

Working with vim regexes for folding html, trying to ignore html tags that start and end on the same line.
So far, I have
if line =~# '<\(\w\+\).*<\/\1>'
return '='
endif
Which works fine for tags like <a></a>, but when dealing with custom elements, I run into issues since there is a hyphen in the tag name.
Like for example, this element
<paper-input label="Input label"></paper-input>
What needs to change in the regex to also catch the hyphen?
The correct regex (updated because of this link) is:
<\([^ >]\+\)[ >].*<\/\1>
or
<\([^ >]\+\)\>.*<\/\1>
This is important [^ >]. This will match any character until whitespace or > i.e. it will match both a and paper_input

Delphi XE2 Regex: Quantifier does not work inside positive lookbehind?

I have a complete HTML document string from a web page containing this BASE tag:
<BASE href="http://whatreallyhappened.com/">
In Delphi XE2, I use this regular expression with the whole HTML document as subject to get the URL from the BASE tag between the double quotes:
BaseURL := TRegEx.Match(HTMLDocStr, '(?<=<base(\s)href=").*(?=")', [roIgnoreCase]).Value;
This works, but only if there is only ONE space character in the subject between BASE and href.
I tried to add a quantifier to the space part in the regex (\s), but it did not work.
So how can I make this regex match the URL even if there are several spaces between BASE and href?
You're making this far too complicated by using lookaround. If you want to extract only part of the regex match, simply add a capturing group. Then you can use the text matched by the capturing group instead of the overall match. In most cases you'll also get much better performance this way.
To find the base tag in a file and extract its URL you can use the regex <base[^>]+href=["']([^"']*)["']. Call TRegex.Match() to get a TMatch. This has a Groups property that you can use to retrieve group 1 if a match was found.
With lookaround
You can use different ways to try using quantifiers like these:
(?<=<BASE)\s+href=".*(?=")
(?<=<BASE)\s{0,30}href=".*(?=")
Working demo
Without lookaround
By the way, if you want just to get the content within href there is no need of lookaround you just can use:
<BASE\s+href="(.*?)"
Working demo
EDIT: after reading your comments I figured out a workaround (ugly but could work). You can try using something like this:
((?<=<BASE\shref=")|(?<=<BASE\s\shref=")|(?<=<BASE\s\s\shref=")).*(?=")
^---notice \s ^---notice \s\s ^---notice \s\s\s
I know that this is horrible, but if none of above work you can try with that.

JSP Tag Spacing Regex

We are suppose to migrate all our apps from one type of server to another. The new servers do not accept invalid JSP tags where a space is not inserted between the attributes. For example, the following.
<input type="text"name="myField" />
The following regex was given to us to use, but it seems to not be perfect.
[\w.-]+[\s]*=[\s]*"[^"]+"[^\s/%>]
For example, it returns string assignments like the following.
span.style.fontWeight = "bold";
Can anyone suggest a better regex for locating just the invalid JSP code?
UPDATE
I was this regex to work using the Eclipse Search > File functionality.
Try simply this RegEx: (<.+?[^" ]+?="[^"]+?")([^ ]+?)(.+?>). Will locate all "tags" with a " not followed by a space. Then you can replace the captured groups like this: $1 $2$3 to add a space.
Tenub's answer is nearly correct, but as Rachel G. mentioned, it will return false positives when the closing bracket immediately follows the closing quotation mark.
(<[^?%].+?[^" ]+?="[^"]+?")([^/ >]+?)([^>]*(?:/|\?|%)?>)
Should give you the results you're after.
Disclaimer: This is not a strict checker. You could have a tag such as <..." asdf/> go undetected, but as the tags are presumably well formed enough to work under the old system, this should be sufficient.
Simple version:
Find: (=\s*"[^"]*")(\w)
Replace with: $1 $2
Explanation
The find regex looks for = followed by optional whitespace followed by "...", immediately followed by a single alphanumeric character or underscore.
It's separated out into two capturing groups, which are represented by $1 and $2 in the replace expression - with a space inserted between them.
[Minor Issue: This won't work for attribute values that include escaped double quotation marks. Haven't addressed this as am assuming it is pretty unlikely. However, it justifies doing a manual find/replace rather than "replace all" just in case.]

Matching all occurrences of a html element attribute in notepad++ regex

I have a file which has hundreds of links like this:
<h3>aspnet</h3>
Ex 1
Ex 2
Ex 3
So I want to remove all the elements
icon="..."
from all the lines. I went through the official Notepad++ regex wiki and have come up with this after several trials:
icon=\"[^\.]+\"
The problem with this is, it is selecting past the second double quote and stopping at the next occurring double quote. To illustrate, this will select the following content:
icon="data:image/png;base64,...jbvebich4sec9zgth1sfue1cdt...">EX 1</a> <a href="
If I modify the above regex to,
icon=\"[^\.]+\">
Then it is almost perfect, but it is also selecting the >:
icon="data:image/png;base64,...jbvebich4sec9zgth1sfue1cdt...">
The regex I am looking for would select like this:
icon="data:image/png;base64,...jbvebich4sec9zgth1sfue1cdt..."
I also tried the following, but it doesn't match anything at all
icon=\"[^\.]+\"$
Just match anything but a quote, followed by a quote:
icon="[^"]+"
Just tested with notepad++ 6.2.2 and confirmed that this matches correctly as written.
Broken down:
icon="
This is fairly obvious, match the literal text icon=".
[^"]+
This means to match any character that is not a ". Adding the + after it means "one or more times."
Finally we match another literal ".
I am not a notepad++ user. so don't know how notepad++ plays with regex, but can you try to replace
icon=\"[^>]* to (empty string) ?
Try this solution:
This is I just check was working as you wanted it.
The way achieving your goal:
Find what: (icon.*")|.*?
Replace with: $1

Regex filter XAML tags with specific attribute

I have the following Regex that matches XAML tags:
<[^<>]*>
This would return both of these lines:
<Button x:Name="Button1" />
<Button x:Name="Button2" Content="foo" />
What I want to do is filter out tags that have "Content="foo"" in them.
There are similar examples out there, but in this case the quotation marks are tripping me up.
Any ideas?
Regex
/<[^>]*Content="foo".*?>/g
Which means start regex (/) and search for < folled by zero or more not >s ([^>]*) followed by Content="foo" followed by zero or more anythings (.*) un-greedily, followed by '>' and end regex /.
Test Code:
Written in javascipt, but can be ported into other languages easily.
x = ['<Button x:Name="Button1" /><Button x:Name="Button2" Content="foo" /><Button x:Name="Button1" /><Button Content="foo" />']
console.log( x[0].match(/<[^>]*Content="foo".*?>/g) )
Updated to match exact opposite (as required by OP)
Regex
Using negative lookahead assertion, as detailed here
<((?!Content="foo")[^>])*>
Which means match < followed by zero or more not >s ([^>]*) that each do not have Content="foo" in front of them followed by >. Brackets added for necessary grouping.
But remember - using regex as a substitute for XML parsing will make people on stack overflow get really angry with you... so better to post questions like this anonymously. ;)
<([^<>](?!\sContent\s*=\s*("foo"|'foo')))*>
This... But PLEASE, don't use Regexes to parse XML.
Each character we capture with [^<>] we check if it is followed by one space and the Content="foo". If true we fail so the capture is rollbacked of one character and the Regex checks for the terminating >.
http://regexr.com?2umr9
You have to click around a little so that the Regex is "activated".
Some small notes: this Regex is WRONG because it's trying to parse XML, but ignoring this, it won't be fooled by SContent="foo" (ignored) or Content='foo' (catched) or :Content='foo' (ignored) or Content = "foo" (catched). This clearly if you are using a Regex parser that treats \s as the same list of space characters as XML :-) Otherwise some "strange, alien" spaces could break it a little. Remember to use it with case sensitive parsing!