regex match expression except specific string (no negative lookahead) - regex

i'm trying to write a regex that matches most cases of HTML elements, like for example:
<script></script>
I would like to make an exception for the following HTML tag specifically:
<b>
Which I don't want to capture. Is there a way to do it without using negative lookahead/lookbehind?
At the moment i have something like this:
((\%3C)|<)[^<b]((\%2F)|\/)*[^<\/b][a-z0-9\%\=\'\(\)\ ]+((\%3E)|>)
https://regex101.com/r/ZxkVMJ/2
It does work, but beside
<b>
it also doesn't capture all 1 character tags
(like <a> for example)
as well as longer tags that start with b, like for example
<balloon>
Thank you for any help

As a disclaimer, if you have the availability of any kind of XML/HTML parser, you should really use that for your current problem. If you are forced to use regex here, then consider this pattern:
<([^b][^>]*|b[^>]+)>.*?<\/\1>
This matches an HTML tag which either starts with a letter other than b, or a tag which does start with b, but then is followed by one or more other characters (thus ruling out <b>). Here is a working demo:
Demo

Related

Regular Expression - Starting and ending with, and contains specific string in the middle

I would like to generate a regex with the following condition:
The string "EVENT" is contained within a xml tag called "SHEM-HAKOVETZ".
For example, the following string should be a match:
<SHEM-HAKOVETZ>104000514813450EVENTS0001dfd0.DAT</SHEM-HAKOVETZ>
I think you want something like this ^<SHEM-HAKOVETZ>.*EVENT.*<\/SHEM-HAKOVETZ>$
Regular expression
^<SHEM-HAKOVETZ>.*EVENTS.*<\/SHEM-HAKOVETZ>$
Parts of the regular expression
^ From the beginning of the line
<SHEM-HAKOVETZ> Starting tag
.* Any character - zero or more
EVENT Middle part
<\/SHEM-HAKOVETZ>$ Ending part of the match
Here is the working regex.
If you want to match this line, you could use this regex:
<SHEM-HAKOVETZ>*EVENTS.*(?=<\/SHEM-HAKOVETZ>)
However, I would not recommend using regex XML-based data, because there may be problems with whitespace handling in XML (see this article for more information). I would suggest using an actual XML parser (and then applying the reg to be sure about your results.
Here is a solution to only match the "value" part ignoring the XML tags:
(?<=<SHEM-HAKOVETZ>)(?:.*EVENTS.*)(?=<\/SHEM-HAKOVETZ>)
You can check it out in action at: https://regex101.com/r/4XiRch/1
It works with Lookbehind and Lookahead to make sure it will only match if the tags are correct, but for further coding will only match the content.

Can regex alternatives (a|b|c) work together with the end of the pattern?

I have a regex pattern that I'm using to try and match anything wrapped in an <a>, <em>, or quote ".
(?:<a.*?>|<em>|")(.*?)(?:"|<\/em>|<\/a>)
However, what I'd like to do is force the <a>'s to work together, and the <em>'s and so on. What I want not to happen is it to match a string that starts with an <a> but ends with a ".
For example:
<a href='google.com'>"Google"</a>
Should return Google and (probably also "Google", but thats not a big deal). However, at the moment, its returning href='google.com'> as a match (and completely ignoring "Google") since it starts and ends with the "correct" patterns.
You can see all the ways this particular pattern breaks here on Regex101.
So is there a way to tell regex that if it starts a match with <a> that it must finish with </a> (and the same for the other patterns)?
You want a back reference:
<(a|em|")[^>]*>(.*?)(?:</\1>)
See live demo.
Your target is in group 2 (there's no avoiding capturing the tag as group 1 if you use a back reference).

Regex Full match

I'm trying to understand regular expressions:
I need to only match on text_01 and text_02 and filter out the tags.
<span>text_01<b>text_02</b>
I've tried to do it like:
(?<=<span>)(([^>]+)<b>)(.+?)(?=</b>)
But it captures 3 groups and and the Full Match includes a tag.
text_01<b>text_02
Could you give me advice on how I need to build a regex whose Full match contains only text and no tags?
Parsing HTML with regular expressions can get very complicated. In general it is not advised practice and better to use a parser for this (some library in whatever language you are using).
But for cases where you are sure the text content does not have < nor >, and these < and > are not nested, you could use this one:
[^<>]*(?=<[^<>]*>)
This only matches text that is followed by a pair of < and >.
If it is enough to test that text is followed by <, it can be simply:
[^<>]*(?=<)
By using a non-capturing group you are able to exclude the middle <b> tag as a capture group, but you will never be able to get a full match without the tag included. It's not possible, a regular expression cannot skip a part while capturing. A match must be consecutive.
(?<=<span>)(.+?)(?:<b>)(.+?)(?=<\/b>)
Full match text_01<b>text_02
Group 1. text_01
Group 2. text_02

What is the regex to match a single element in a group of almost equivalent elements?

In the following content:
<page1 ...>
...
</page>
<page2 ...>
...
</page>
<page3 ...>
...
<queue>...</queue>
...
</page>
How do you find the match just for the very last element (the one that contains the queue tag)?
I have tried
(?s)<page.*?<queue>.*?</page>
But that matches the ENTIRE content. I've been trying to play around with lookaheads, but can't figure it out.
You can use following monster for your particular use case:
<page(?:[^/]+/(?!page))+queue>(?:[^/]+|/(?!page))+/page>
..not sure if this is the best example for learning regex and definitely not a good idea to use to parse XML in real life. But it is possible. Don't forget to escape / by \/ in languages that quote regular expressons inside /.../ construct.
See technical explanation at http://regex101.com/r/qZ0yR1/2.
The logic is following:
<page.../queue>.../page> - get the content of a page element that contains end tag for queue
[^/]+/(?!page) - match all text up to the next closing tag, but make sure that it is not a closing tag for page
(?:[^/]+/(?!page))+queue> - repeat above match as many times as needed until the closing tag is for queue
(?:[^/]+|/(?!page))+/page> - then repeat as many times as needed until the closing tag is for page (I used | as a shortcut for (?:[^/]+/(?!page))+[^/]+/page>, because the expression in point 2. will only match the text if following closing tag is not for page, but we need to match exactly that text in the end)
you could use this pattern
(?:<page[^>]*>(?:(?!<queue>).)*?<\/page>)|(<page[^>]*>.*?<\/page>)
Demo
the idea here is to consume the tags that do not contain queue first then consume and capture the ones that does.
This is the most succinct I could muster:
<page(.(?!page))*<queue.*<\/page>
You need the DOTALL flag set, and the whole match is your target.
See demo
You can use a greedy match (.*) to match everything up to the last tag.
Here's an example (excuse the Java):
final String str = "<page1 foo='bar'>apple</page> <page2 foo='bar'>orange</page> <page3 foo='bar'>pear</page>";
final Pattern p = Pattern.compile(".*<page[^>]+>(\\w+)</page>$");
final Matcher matcher = p.matcher(str);
matcher.find();
// Prints pear
System.out.println(matcher.group(1));
Also, +1 for 'why pick regex'; regex is not a good fit for this problem.
Assuming the tag might not be 'queue' and could be anything else, try the following:
(?<=[>]).*(?=\<\/[\w]+\>([\n]?)(.*[\n])?\<\/page\>$)
example here:
http://regex101.com/r/sN6aC5/1
This uses a look ahead to find the last closed tag </...> that is followed by anything and then a closed page tag </page> that is the end of the string. It then, using a lookbehind, matches everything between this final close tag and the first > before that (which should be the last opening tag)

Delphi XE2 Regex: Quantifier does not work inside positive lookbehind?

I have a complete HTML document string from a web page containing this BASE tag:
<BASE href="http://whatreallyhappened.com/">
In Delphi XE2, I use this regular expression with the whole HTML document as subject to get the URL from the BASE tag between the double quotes:
BaseURL := TRegEx.Match(HTMLDocStr, '(?<=<base(\s)href=").*(?=")', [roIgnoreCase]).Value;
This works, but only if there is only ONE space character in the subject between BASE and href.
I tried to add a quantifier to the space part in the regex (\s), but it did not work.
So how can I make this regex match the URL even if there are several spaces between BASE and href?
You're making this far too complicated by using lookaround. If you want to extract only part of the regex match, simply add a capturing group. Then you can use the text matched by the capturing group instead of the overall match. In most cases you'll also get much better performance this way.
To find the base tag in a file and extract its URL you can use the regex <base[^>]+href=["']([^"']*)["']. Call TRegex.Match() to get a TMatch. This has a Groups property that you can use to retrieve group 1 if a match was found.
With lookaround
You can use different ways to try using quantifiers like these:
(?<=<BASE)\s+href=".*(?=")
(?<=<BASE)\s{0,30}href=".*(?=")
Working demo
Without lookaround
By the way, if you want just to get the content within href there is no need of lookaround you just can use:
<BASE\s+href="(.*?)"
Working demo
EDIT: after reading your comments I figured out a workaround (ugly but could work). You can try using something like this:
((?<=<BASE\shref=")|(?<=<BASE\s\shref=")|(?<=<BASE\s\s\shref=")).*(?=")
^---notice \s ^---notice \s\s ^---notice \s\s\s
I know that this is horrible, but if none of above work you can try with that.